Skip to content

META 2024-06-16

Read original ↗

Meta — Maintaining large-scale AI capacity at Meta

Summary

A Meta Production Engineering post describing how Meta maintains — patches, upgrades, verifies — the GPU training fleet that runs "thousands of training jobs every day from hundreds of different Meta teams" and is on a trajectory to 600,000 GPUs. The hard problem: AI training jobs are interruption-sensitive, hardware-coupled, and capacity-hungry, but the fleet needs continuous firmware/driver/OS updates and per-host verification. Meta's answer is a stack of four primitives: maintenance trains (cyclic small-batch drains), maintenance domains sized to minimise capacity loss while keeping interruption cost acceptable, sliding-window component upgrades that keep the job-facing stack consistent while lower-level components drift, and a unified orchestrator OpsPlanner that serialises overlapping operations at "a million operations per day."

Key takeaways

  1. GPU training imposes five demanding properties on maintenance. Meta names them: (a) capacity guarantees — many jobs are time-critical or online, so large drains are structurally unacceptable; (b) bad hosts are very bad — synchronisation means a single slow/faulty host damages the whole job, far out of proportion to its share; (c) low interruption rate"many hosts work with each other on a shared problem, AI training jobs are sensitive to interruptions"; (d) rollout safety"the AI software stack is deep, and problems are often hard to pinpoint"; (e) host consistency"cluster consistency is highly important for debugging and SEV avoidance." These five jointly motivate every design below. (Source text)
  2. Operational scale: >30 maintenance operations, >50 updated components, 3 verification tasks, thousands of disruptive AI host tasks per day. Meta enumerates these as concrete fleet-level load. "We need to do them safely, while guaranteeing capacity." (Source text)
  3. Overlapping rollouts are mandatory at Meta's scale. Keeping clusters in lock-step — all components on the same version in the same window — is "operationally infeasible" in a diverse fleet. Meta explicitly accepts overlapping rollouts: component upgrades roll up "in a sliding fashion" with ensured inter-component compatibility. This is a Meta statement that consistent cluster state is not achievable at this scale, and the architectural response is compatibility-over-synchrony. (Source text)
  4. Maintenance trains: a capacity-predictability primitive. "A small number of servers are taken out of production and maintained with all applicable upgrades." The contract: "all capacity minus one maintenance domain is up and running 24/7." Trains pick up any new upgrade and guarantee a full-visit cycle in a bounded timeframe; longer upgrades can be scheduled across multiple cycles; upgrades can be aligned across trains when beneficial. This is the specific operational mechanism Meta uses for all capacity — compute and storage, not only AI. (Source text)
  5. AI-specific maintenance-train variants. For AI capacity, Meta has "optimized domains that allow for different kinds of AI capacity, very strict SLOs, and a contract with services that allows them to avoid maintenance-train interruptions, if possible." Concretely: some AI jobs can negotiate train-avoidance; the train mechanism respects that. (Source text)
  6. Gradual rollout with two-layer consistency. Meta's core trick for squaring "different servers on different versions" with "AI jobs hate inconsistency": the AI job container (including the CUDA library) is kept consistent across the cluster; only the lower-level components (below the job) drift during a rollout. "Lower-level components often require hours to install and configure or require rebooting the host, while higher-level components in the job container itself can be restarted fluidly." Distinction is load-bearing: upper-layer restart is cheap so it stays locked; lower-layer upgrade is expensive so it's allowed to drift — as long as it remains compatible with the pinned upper layer. (Source text; see concepts/host-consistency-sliding-upgrade)
  7. Maintenance domain sizing is a cost-of-interruption vs. capacity-buffer trade-off. A maintenance domain is the fraction of capacity taken down in one go. Meta says: "selecting the optimal size is a function of both the cost of interruptions and the capacity that is lost during the maintenance duration. Since interruption costs are high for AI jobs, optimizing this relationship allowed us to significantly reduce the maintenance overhead for AI capacity." The public text doesn't disclose the actual domain sizes, but names the axes explicitly: interruptions per drain ↑ vs reserved buffer capacity ↓. (Source text)
  8. OpsPlanner serialises a million operations/day. Meta's unified "disruptive-work orchestrator" owns (a) overlap serialisation across operations, (b) safe drain/return to production with handover flow that "ensures correct escalation behavior and avoids overlaps and deadlocks", (c) pre-return verification"OpsPlanner can ensure upgrades are applied to hosts before they are returned to production" — and (d) planned-maintenance + failure buffers, which it owns and safeguards. Datapoint: "OpsPlanner currently handles a million operations per day." (Source text)
  9. Safety features: autostop, auto-offboard, phased rollout, emergency trains. Meta names four concrete safety primitives: (a) autostop of maintenance trains if maintenance or failure buffers are exhausted — trains don't starve recovery capacity; (b) automatic offboarding of failing upgrades — a bad patch is removed from the rollout automatically; (c) rollout phases"only well-tested changes reach global systems"; (d) emergency trains and large-scale maintenance for breaking upgrades for when something goes wrong. (Source text)

Systems / hardware extracted

Concepts extracted

  • concepts/maintenance-domain — the sized unit of capacity taken down in one maintenance action, sized by the interruption-cost vs buffer-cost trade-off. New.
  • concepts/host-consistency-sliding-upgrade — the two-layer discipline: pin the job-container layer; slide lower-level components gradually. New.
  • concepts/overlapping-rollouts — the architectural acceptance that at hyperscale, rollouts cannot be serialised into non-overlapping windows, so upgrades are instead engineered for per-component compatibility and run concurrently. New.

Existing concepts reinforced:

  • concepts/fleet-patching — Meta's maintenance-train + OpsPlanner substrate is a compute-fleet variant of the MongoDB-Atlas managed-database variant already on wiki; adds a capacity-guarantee contract MongoDB's post does not.
  • concepts/maintenance-window — Meta's "contract with services that allows them to avoid maintenance-train interruptions, if possible" is the internal-multi-tenant variant of the MongoDB-Atlas per-customer maintenance window; applies at the service-to-service layer rather than customer-to-vendor.
  • concepts/bad-host-detection — Meta's "bad hosts are very bad" is a GPU-training-specific reinforcement of the Presto-Gateway bad-host-detection existing wiki entry, naming an extended cost model: slow hosts damage synchronised jobs, not just their own query share.
  • concepts/blast-radius — maintenance domain is explicitly a blast-radius sizing decision: small domain = small blast per drain = larger buffer cost.

Patterns extracted

  • patterns/maintenance-train — the cyclic small-batch drain pattern with capacity guarantee ("all capacity minus one maintenance domain"), full-visit-cycle time bound, and multi-upgrade pick-up semantics. New canonical wiki entry.
  • patterns/gradual-rollout-layered-by-stack-depth — the pattern of partitioning the software stack into pinned job-facing layer (kept consistent across the fleet) and drifting lower-level layer (upgraded in sliding fashion with compatibility guarantees). New.
  • patterns/staged-rollout — existing pattern; reinforced via Meta's "rollout phases for upgrades, so that only well-tested changes reach global systems."

Operational / architectural numbers

Datum Value
Current fleet scale "Dozens of AI clusters of varying sizes"
Target scale 600,000 GPUs (next-year plan from 2024-06 post)
Training jobs per day "Thousands" across "hundreds of different Meta teams"
Maintenance operation classes >30
Components updated >50 different components
Host-verification task classes 3
Disruptive AI host tasks per day "Thousands"
OpsPlanner throughput ~1,000,000 operations/day
Maintenance-train capacity contract "All capacity minus one maintenance domain" up 24/7
Largest individual training job class Trillions of parameters, spans thousands of hosts

Not disclosed: actual maintenance-domain percentages; per-upgrade soak times; job-interruption rate; rollback-success statistics; per-cluster capacity buffer.

Caveats

  • The post is light on mechanism: OpsPlanner is named and its throughput is disclosed, but the internal implementation (queue, scheduler, consensus substrate, state store) is not described. Treat OpsPlanner's wiki entry as a capability description, not an architecture description.
  • No numeric disclosures for the interruption-cost / buffer-cost trade-off that drives maintenance-domain sizing. Meta names the axes but not the numbers, so downstream quantitative analysis is not reproducible from this post.
  • The two-layer job/low-level distinction is stated textually ("the AI job itself, which includes the CUDA library, is always consistent") but the enforcement mechanism — how Meta prevents in-flight jobs from landing on a host with a drifted lower-level component incompatible with their CUDA version — is not described. Presumed to be OpsPlanner's pre-return-verification gate plus a compatibility matrix, but this is inferred rather than stated.
  • Tooling for rare compatibility-breaking upgrades is mentioned but not described. Absent a mechanism, these remain an escape hatch rather than a repeatable pattern.
  • The post's closing paragraphs are promotional ("move fast and learn by doing", "pioneer tomorrow's possibilities") — the architectural substance is in the first ~80% of the post and the diagrams.

Source

Last updated · 319 distilled / 1,201 read