Lights Out, Systems On: Validating Instant Power Loss Readiness¶

Summary¶

Meta introduces Instantaneous PowerLoss Storm, a new testing paradigm within Meta's long-established Disaster Readiness (DR) "Storm" program that validates the infrastructure's ability to handle zero-notice complete power loss of an entire data center region. The post describes how Meta built power-loss tolerance from the ground up into the DC stack (mechanical/electrical → server racks → storage → compute → Twine orchestrator), then validated that tolerance by actually de-energizing large production regions housing critical storage, AI, and data warehouse workloads. Two critical bootstrapping problems and their solutions are detailed: circular dependencies in control-plane startup and a "boomerang effect" where shutdown signals kill the orchestrator itself.

Key Takeaways¶

Power-loss tolerance is built bottom-up, not retrofitted. Each layer of the DC stack — mechanical/electrical facilities, server racks, storage, compute, and the Twine orchestrator — was developed with power-loss tolerance as an integral component. Capability to persist in-memory data via batteries and Power Loss Siren (PLS) is one such primitive. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)
Region-wide asynchronous signaling via unavailability events (UEs) provides the coordination mechanism for Twine services during power events. A DC region is where multiple DC buildings are co-located sharing common network and power connectivity — typically 50–60× the size of a typical sub-regional fault domain. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)
Bootstrapping circular dependencies are the primary region-recovery risk. The Twine control plane (Scheduler, Allocator, Broker, Zelos coordinator) cannot start any other services without itself running — but itself is a service that needs to be started. During regular operations this risk is low; during full-region bootstrap it becomes existential. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)
Two-pronged solution for circular dependencies: (a) Belljar tests in CI/CD continuously detect critical startup dependencies early and often, eliminating most dependency risks before production deployment; (b) a purpose-built Twine Recovery Kit (Twrko) provides "jumpstart" capability to break any unexpected circular dependencies at boot. Canonical belt-and-braces approach — prevent and recover. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)
The "boomerang effect": UE shutdown signals ended up shutting down the orchestrator control-plane services themselves, orphaning services that could never be reaped. Solved by letting control-plane services ignore shutdown signals associated with power-related UEs — a simpler, more sustainable approach than maintaining complex exclusion lists. Canonical ignore-self-generated-shutdown-signal pattern. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)
Tradeoffs are explicit and bounded. Meta draws a hard line on unacceptable impacts (data loss, permanent DC damage, sustained cross-region impact) versus tolerable impacts (transient service errors, rack failures within threshold, bounded staleness in routing tables / region-unavailability detection). Only issues requiring more than post-incident remediation within a reasonable MTTR are outside the tolerance boundary. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)
Validation followed an incremental blast-radius approach: new/pre-production regions → shadow regions replicating production → smallest production regions → large production regions with critical workloads. Canonical patterns/incremental-blast-radius-validation instance — take risk to address risk, but escalate the risk gradually. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)
No preemptive actions before tests — the Storm exercises aim to truly represent unexpected power loss; MTTR chosen mirrors typical real incident scenarios. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)
Reliability and velocity are "two facets of the same coin": the ability to recover from instantaneous failure laid the foundation for faster DC design innovation, capacity deployment, and pushing the envelope further. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)
Future work: extending the same incremental validation strategy toward regions with live client traffic against instantaneous failures (previous Storms validated storage and database backends). (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)

Operational Numbers¶

A typical DC region is 50–60× the size of a typical sub-regional fault domain
Storm exercises culminated in powering off large production regions housing critical storage, AI, and data warehouse workloads
MTTR used in exercises mirrors typical MTTR seen during real incident scenarios

Caveats¶

Architecture-and-discipline voice — no specific region count, region capacity (MW/GPUs), failure count, exact MTTR targets, or post-storm availability numbers disclosed
Timeline not disclosed beyond "long-established" DR Storm program
Specific Twrko architecture not detailed
Belljar detection completeness not quantified
Tolerable rack-failure threshold not specified
Shadow-region fidelity-to-production not characterised

Source¶

companies/meta — Tier-1 source
concepts/circular-dependency — bootstrapping variant at region scale
concepts/blast-radius — region as fault domain
concepts/chaos-engineering — Storm as production chaos test
concepts/defense-in-depth — layered power-loss tolerance
systems/meta-twine — container orchestrator central to recovery
patterns/incremental-blast-radius-validation — staged escalation of test scope