PATTERN Cited by 1 source
Durability review¶
Pattern¶
A durability review is a structured gate applied to any code change that could affect a service's durability posture. The artifact is modeled after a security threat model (see concepts/threat-modeling):
- Summary of the change under review.
- Comprehensive list of threats — "what are all the things that might go wrong?"
- Resilience narrative — for each threat, how the change is resilient (or explicit acknowledgement that it isn't).
The author is encouraged to be "creatively critical of their own code."
Context¶
- Problem: Durability bugs at the bottom of a storage stack cause silent corruption. They're expensive both in data loss and in customer trust, and they're hard to catch late.
- Why not only code review? Standard code review looks at the delta against the local module. Durability issues often hide in the interactions between the change and the broader failure-model of the system — a threat not visible in a diff alone.
- Why not only testing? You can't test exhaustively against a real drive running at 120 IOPS (see concepts/hard-drive-physics).
Mechanism¶
The process does two things well (Warfield, 2025):
- It encourages authors and reviewers to really think critically about the risks we should be protecting against.
- It separates risk from countermeasures, and lets us have separate discussions about the two sides.
The separation is the less-obvious power: risks are enumerated once and carried forward across reviews, while countermeasures can be debated on their own merits (and sometimes replaced by a broader mechanism that kills many risks at once).
Preference for coarse-grained guardrails¶
When identifying protections, the preference is not to write one mitigation per threat, but to find guardrails — simple mechanisms that defeat whole classes of risks.
Rather than nitpicking through each risk and identifying individual mitigations, we like simple and broad strategies that protect against a lot of stuff.
Canonical example: ShardStore's executable specification (see systems/shardstore, concepts/lightweight-formal-verification). Rather than list "disk corruption bug A, B, C…" and mitigate each, the team invested in one structural property — "production behavior must match specification behavior" — and built a verification pipeline to enforce it on every commit.
Pairs with¶
- patterns/executable-specification — the output of a durability review frequently looks like "build this guardrail," and an executable spec is a canonical guardrail.
- concepts/threat-modeling — the structural framework the review inherits.
- concepts/lightweight-formal-verification — the "industrialized verification" flavor of guardrail.
Organizational positioning¶
Durability reviews are called out as a human mechanism that sits alongside the statistical durability model:
It's a human mechanism that's not in the statistical 11 9s model, but it's every bit as important.
The 11 9s number comes from probabilistic durability math on hardware failure rates. Durability reviews are how you keep software changes from being the dominant source of data loss — something the hardware model cannot cover.
Seen in¶
- sources/2025-02-25-allthingsdistributed-building-and-operating-s3 — origin and description of the S3 durability-review process.