Skip to content

META

Read original ↗

Lights Out, Systems On: Validating Instant Power Loss Readiness

Summary

Meta introduces Instantaneous PowerLoss Storm, a new testing paradigm within Meta's long-established Disaster Readiness (DR) "Storm" program that validates the infrastructure's ability to handle zero-notice complete power loss of an entire data center region. The post describes how Meta built power-loss tolerance from the ground up into the DC stack (mechanical/electrical → server racks → storage → compute → Twine orchestrator), then validated that tolerance by actually de-energizing large production regions housing critical storage, AI, and data warehouse workloads. Two critical bootstrapping problems and their solutions are detailed: circular dependencies in control-plane startup and a "boomerang effect" where shutdown signals kill the orchestrator itself.

Key Takeaways

  1. Power-loss tolerance is built bottom-up, not retrofitted. Each layer of the DC stack — mechanical/electrical facilities, server racks, storage, compute, and the Twine orchestrator — was developed with power-loss tolerance as an integral component. Capability to persist in-memory data via batteries and Power Loss Siren (PLS) is one such primitive. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)

  2. Region-wide asynchronous signaling via unavailability events (UEs) provides the coordination mechanism for Twine services during power events. A DC region is where multiple DC buildings are co-located sharing common network and power connectivity — typically 50–60× the size of a typical sub-regional fault domain. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)

  3. Bootstrapping circular dependencies are the primary region-recovery risk. The Twine control plane (Scheduler, Allocator, Broker, Zelos coordinator) cannot start any other services without itself running — but itself is a service that needs to be started. During regular operations this risk is low; during full-region bootstrap it becomes existential. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)

  4. Two-pronged solution for circular dependencies: (a) Belljar tests in CI/CD continuously detect critical startup dependencies early and often, eliminating most dependency risks before production deployment; (b) a purpose-built Twine Recovery Kit (Twrko) provides "jumpstart" capability to break any unexpected circular dependencies at boot. Canonical belt-and-braces approach — prevent and recover. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)

  5. The "boomerang effect": UE shutdown signals ended up shutting down the orchestrator control-plane services themselves, orphaning services that could never be reaped. Solved by letting control-plane services ignore shutdown signals associated with power-related UEs — a simpler, more sustainable approach than maintaining complex exclusion lists. Canonical ignore-self-generated-shutdown-signal pattern. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)

  6. Tradeoffs are explicit and bounded. Meta draws a hard line on unacceptable impacts (data loss, permanent DC damage, sustained cross-region impact) versus tolerable impacts (transient service errors, rack failures within threshold, bounded staleness in routing tables / region-unavailability detection). Only issues requiring more than post-incident remediation within a reasonable MTTR are outside the tolerance boundary. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)

  7. Validation followed an incremental blast-radius approach: new/pre-production regions → shadow regions replicating production → smallest production regions → large production regions with critical workloads. Canonical patterns/incremental-blast-radius-validation instance — take risk to address risk, but escalate the risk gradually. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)

  8. No preemptive actions before tests — the Storm exercises aim to truly represent unexpected power loss; MTTR chosen mirrors typical real incident scenarios. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)

  9. Reliability and velocity are "two facets of the same coin": the ability to recover from instantaneous failure laid the foundation for faster DC design innovation, capacity deployment, and pushing the envelope further. (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)

  10. Future work: extending the same incremental validation strategy toward regions with live client traffic against instantaneous failures (previous Storms validated storage and database backends). (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)

Operational Numbers

  • A typical DC region is 50–60× the size of a typical sub-regional fault domain
  • Storm exercises culminated in powering off large production regions housing critical storage, AI, and data warehouse workloads
  • MTTR used in exercises mirrors typical MTTR seen during real incident scenarios

Caveats

  • Architecture-and-discipline voice — no specific region count, region capacity (MW/GPUs), failure count, exact MTTR targets, or post-storm availability numbers disclosed
  • Timeline not disclosed beyond "long-established" DR Storm program
  • Specific Twrko architecture not detailed
  • Belljar detection completeness not quantified
  • Tolerable rack-failure threshold not specified
  • Shadow-region fidelity-to-production not characterised

Source

Last updated · 542 distilled / 1,571 read