Lights Out, Systems On: Validating Instant Power Loss Readiness¶
Summary¶
Meta introduces Instantaneous PowerLoss Storm, a new testing paradigm within Meta's long-established Disaster Readiness (DR) "Storm" program that validates the infrastructure's ability to handle zero-notice complete power loss of an entire data center region. The post describes how Meta built power-loss tolerance from the ground up into the DC stack (mechanical/electrical → server racks → storage → compute → Twine orchestrator), then validated that tolerance by actually de-energizing large production regions housing critical storage, AI, and data warehouse workloads. Two critical bootstrapping problems and their solutions are detailed: circular dependencies in control-plane startup and a "boomerang effect" where shutdown signals kill the orchestrator itself.
Key Takeaways¶
-
Power-loss tolerance is built bottom-up, not retrofitted. Each layer of the DC stack — mechanical/electrical facilities, server racks, storage, compute, and the Twine orchestrator — was developed with power-loss tolerance as an integral component. Capability to persist in-memory data via batteries and Power Loss Siren (PLS) is one such primitive. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)
-
Region-wide asynchronous signaling via unavailability events (UEs) provides the coordination mechanism for Twine services during power events. A DC region is where multiple DC buildings are co-located sharing common network and power connectivity — typically 50–60× the size of a typical sub-regional fault domain. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)
-
Bootstrapping circular dependencies are the primary region-recovery risk. The Twine control plane (Scheduler, Allocator, Broker, Zelos coordinator) cannot start any other services without itself running — but itself is a service that needs to be started. During regular operations this risk is low; during full-region bootstrap it becomes existential. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)
-
Two-pronged solution for circular dependencies: (a) Belljar tests in CI/CD continuously detect critical startup dependencies early and often, eliminating most dependency risks before production deployment; (b) a purpose-built Twine Recovery Kit (Twrko) provides "jumpstart" capability to break any unexpected circular dependencies at boot. Canonical belt-and-braces approach — prevent and recover. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)
-
The "boomerang effect": UE shutdown signals ended up shutting down the orchestrator control-plane services themselves, orphaning services that could never be reaped. Solved by letting control-plane services ignore shutdown signals associated with power-related UEs — a simpler, more sustainable approach than maintaining complex exclusion lists. Canonical ignore-self-generated-shutdown-signal pattern. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)
-
Tradeoffs are explicit and bounded. Meta draws a hard line on unacceptable impacts (data loss, permanent DC damage, sustained cross-region impact) versus tolerable impacts (transient service errors, rack failures within threshold, bounded staleness in routing tables / region-unavailability detection). Only issues requiring more than post-incident remediation within a reasonable MTTR are outside the tolerance boundary. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)
-
Validation followed an incremental blast-radius approach: new/pre-production regions → shadow regions replicating production → smallest production regions → large production regions with critical workloads. Canonical patterns/incremental-blast-radius-validation instance — take risk to address risk, but escalate the risk gradually. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)
-
No preemptive actions before tests — the Storm exercises aim to truly represent unexpected power loss; MTTR chosen mirrors typical real incident scenarios. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)
-
Reliability and velocity are "two facets of the same coin": the ability to recover from instantaneous failure laid the foundation for faster DC design innovation, capacity deployment, and pushing the envelope further. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)
-
Future work: extending the same incremental validation strategy toward regions with live client traffic against instantaneous failures (previous Storms validated storage and database backends). (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)
Operational Numbers¶
- A typical DC region is 50–60× the size of a typical sub-regional fault domain
- Storm exercises culminated in powering off large production regions housing critical storage, AI, and data warehouse workloads
- MTTR used in exercises mirrors typical MTTR seen during real incident scenarios
Caveats¶
- Architecture-and-discipline voice — no specific region count, region capacity (MW/GPUs), failure count, exact MTTR targets, or post-storm availability numbers disclosed
- Timeline not disclosed beyond "long-established" DR Storm program
- Specific Twrko architecture not detailed
- Belljar detection completeness not quantified
- Tolerable rack-failure threshold not specified
- Shadow-region fidelity-to-production not characterised
Source¶
- Original: https://engineering.fb.com/2026/06/03/data-center-engineering/lights-out-systems-on-validating-instant-power-loss-readiness/
- Raw markdown:
raw/meta/2026-06-03-lights-out-systems-on-validating-instant-power-loss-readines-9a68295a.md
Related¶
- companies/meta — Tier-1 source
- concepts/circular-dependency — bootstrapping variant at region scale
- concepts/blast-radius — region as fault domain
- concepts/chaos-engineering — Storm as production chaos test
- concepts/defense-in-depth — layered power-loss tolerance
- systems/meta-twine — container orchestrator central to recovery
- patterns/incremental-blast-radius-validation — staged escalation of test scope