SYSTEM Cited by 1 source
Meta Instantaneous PowerLoss Storm¶
Instantaneous PowerLoss Storm is a testing paradigm within Meta's long-established Disaster Readiness (DR) "Storm" program that validates Meta's infrastructure can handle zero-notice complete power loss of an entire data center region with minimal impact to overall fleet availability.
Architecture¶
From 10,000 feet, the Storm consists of:
- A power supply fault injected to cause immediate de-energization of the entire region
- After a short MTTR, remedial "drain" actions cordon off the impacted region from global controllers/schedulers
- No preemptive actions taken prior to the test — truly representing an unexpected power loss
MTTR chosen mirrors typical MTTR seen during real incident scenarios.
Validation Strategy¶
Follows a canonical incremental blast-radius approach:
- Validate self-contained problems (e.g., dependencies) in new/pre-production regions
- Run tests in shadow regions replicating production
- Test in smallest production regions with limited blast-radius
- Power off large production regions housing critical storage, AI, and data warehouse workloads
Scope¶
Previous Storms validated storage and database backends. Future work extends to validating regions with live client traffic against instantaneous failures.
Lineage¶
Part of Meta's long-established DR Storm program, which itself is an evolution of the Disaster Readiness programme first publicly discussed at @Scale conferences.
Seen in¶
- sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness — canonical disclosure