Skip to content

CONCEPT Cited by 1 source

Maintenance domain

Definition

A maintenance domain is the fraction of fleet capacity taken offline together in a single maintenance action. It is the unit of blast radius for a routine fleet operation (firmware flash, driver upgrade, OS patch, reboot) and the unit of capacity buffer the rest of the fleet must reserve to absorb the drain.

A maintenance domain is distinct from a failure domain (what a single unplanned fault reaches) — failures happen whether planned or not; maintenance domains are the planned equivalent. They are also distinct from a maintenance window: the domain is what is taken down, the window is when.

The sizing trade-off

From sources/2024-06-16-meta-maintaining-large-scale-ai-capacity-at-meta, Meta names the two axes explicitly:

"Maintenance domains are selected based on the amount of buffer-reserved capacity (the smaller the better) and the amount of interruptions we cause to training jobs (the bigger the better)."

Reading that carefully: the goal is a small buffer (don't reserve more than you need) and infrequent interruptions (don't churn through jobs). These are in tension:

Choice Buffer cost Interruption rate When this wins
Small domain (e.g. 0.5% of fleet) Low — small buffer covers one domain High — to cover the full fleet, you do many small drains When interruption cost per host is low, buffer cost per host is high
Large domain (e.g. 10% of fleet) High — reserved buffer must cover one domain Low — few drains per fleet-wide cycle When interruption cost per host is high, buffer cost per host is low

For AI training: interruption cost is very high (whole-job synchronised failure) → Meta tunes toward larger domains with low drain frequency. For stateless serving: interruption cost is low → tune toward smaller domains with high drain frequency and minimal buffer.

Meta's source names this explicitly:

"Since interruption costs are high for AI jobs, optimizing this relationship allowed us to significantly reduce the maintenance overhead for AI capacity."

Load-bearing properties

  • Capacity predictability. "All capacity minus one maintenance domain is up and running 24/7." The domain is the explicit exception in a capacity contract.
  • Bounded blast radius. A bad upgrade on the domain affects the domain only — it's drained; the rest of the fleet is isolated from the failure.
  • Buffer sizing anchor. The planned-maintenance buffer is sized to cover at least one maintenance domain plus the failure buffer for unplanned failures on the rest of the fleet.

Scope variations

Meta states maintenance domains are workload-tuned:

"For AI capacity, we have optimized domains that allow for different kinds of AI capacity, very strict SLOs, and a contract with services that allows them to avoid maintenance- train interruptions, if possible."

So a single fleet may have multiple domain definitions — one per workload class — each with its own SLO and avoid-interruption contract with the dependent service.

What's not disclosed

Meta does not publish the actual percentage sizes of its maintenance domains, the buffer reservation numbers, or the interruption rates achieved. The post names the axes but not the operating point.

Relationship to other fleet primitives

  • Blast radius — the maintenance domain is a specific, planned blast-radius sizing decision.
  • Fleet patching — the capability; domain is the unit the capability operates on.
  • Maintenance window — the when; domain is the what.
  • Overlapping rollouts — multiple rollouts can target different domains simultaneously (or the same domain sequentially, serialised by OpsPlanner).
  • Maintenance train — the operational pattern that cycles through maintenance domains.

Seen in

Last updated · 319 distilled / 1,201 read