Skip to content

CONCEPT Cited by 1 source

Fault domain scaling

Definition

Fault domain scaling is the challenge that arises when the unit of failure being tested or tolerated grows from a sub-regional fault domain (a rack, a row, a building) to an entire region — and the engineering assumptions that held at the smaller scale no longer hold at the larger scale.

Meta's canonical framing: a typical DC region is 50–60× the size of a typical sub-regional fault domain. Systems battle-tested at the smaller unit discovered "outstanding vulnerabilities" at region scale — not only problems of scale (more services to restart) but also of replica placement and autonomous bootstrapping (services must discover each other without manual intervention).

Why It Matters

The naive assumption — "if we can tolerate losing one fault domain, we can tolerate losing N" — breaks because:

  1. Bootstrapping becomes a cold-start problem when all services in a region need to start simultaneously and discover each other autonomously
  2. Replica placement strategies designed for single-fault-domain loss may leave insufficient replicas outside the region
  3. Control-plane self-dependencies that are safely latent under partial failure become existential during full-region recovery (concepts/bootstrapping-circular-dependency)
  4. Signaling mechanisms that orchestrate shutdown/recovery can become victims of the very event they're coordinating (boomerang effect)

Seen in

Last updated · 542 distilled / 1,571 read