PATTERN Cited by 1 source
Incremental blast-radius validation¶
Problem¶
Validating resilience to a catastrophic failure (e.g., region-wide power loss) carries significant risk — you need to take risk to address risk. Testing at full scale from day one could cause the very disaster you're trying to become resilient against. But never testing at full scale leaves the organisation unprepared.
Solution¶
Escalate the scope of destructive testing through a sequence of environments with increasing blast radius, learning and hardening at each stage before graduating to the next:
- Pre-production / turn-up regions — validate self-contained problems (e.g., dependency bootstrapping) during initial region bring-up
- Shadow environments — replicate production topology and workloads without serving live traffic
- Smallest production regions — limited live blast-radius; real workloads, real consequences, bounded scope
- Large production regions — full-scale validation with critical workloads (storage, AI, data warehouse)
Each stage's lessons harden the infrastructure iteratively "towards the long-term goal of handling loss of a region as seamlessly as loss of a sub-regional fault domain."
Key Properties¶
- No preemptive actions — don't drain traffic or pre-position replicas before the test; that defeats the zero-notice validation goal
- Mirror real MTTR — use remediation timelines observed in actual incidents
- Gate graduation — only escalate after prior-stage confidence is established
- Architect for repeatability — the validation becomes a recurring Storm exercise, not a one-time event
Canonical Instances¶
| System | Scope escalation | Source |
|---|---|---|
| Meta Instantaneous PowerLoss Storm | pre-production → shadow → small production → large production regions | sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness |
Distinction from Staged Rollout¶
Staged rollout escalates deployment of new code. Incremental blast-radius validation escalates the scope of destructive testing against existing infrastructure — the tested system doesn't change between stages, only the blast radius of the test environment.