PATTERN Cited by 1 source

Maintenance train¶

Problem¶

Once a fleet reaches sufficient size, no single maintenance window can drain the whole fleet — the capacity loss is too large, or the window required is too long, or both. But maintenance must happen: firmware updates, OS patches, driver upgrades, host verification, hardware refresh, bad-host drains.

The problem is a steady-state one, not a one-shot upgrade: the fleet accumulates new maintenance work continuously. You need a repeating process that (a) reaches every host within a bounded cycle time, (b) preserves capacity guarantees at every moment, and (c) picks up new maintenance work as it arrives.

The pattern¶

A maintenance train is the cyclic operational primitive:

Pick a maintenance domain — a sized fraction of the fleet (the "train car").
Drain that domain from production.
Apply all pending maintenance operations for that domain: firmware, OS, driver, verification tasks.
Verify before return — host must pass pre-return checks (OpsPlanner owns this gate at Meta).
Return the domain to service.
Advance to the next domain. Repeat indefinitely.

The contract the train provides to the rest of the fleet:

"Trains provide the guarantee that all capacity minus one maintenance domain is up and running 24/7." (Source: sources/2024-06-16-meta-maintaining-large-scale-ai-capacity-at-meta)

This is capacity predictability: the rest of the fleet can plan against a floor of "total minus one domain" without needing to know what the train is currently doing.

The cycle-time bound¶

Trains have an additional property: full-visit cycle time guarantee. Every host is visited by the train within a bounded interval.

"Maintenance trains pick up any new upgrade and guarantee a full-visit cycle in a guaranteed timeframe."

This gives upstream teams a rollout SLA: "your patch will reach every host within N days of landing" — so the team shipping a kernel CVE fix can plan its disclosure timeline against the train's visit cadence, without negotiating a bespoke window.

Longer-running upgrades — those requiring multiple train visits per host — can negotiate lower rollout guarantees:

"Longer-running upgrades can have lower rollout guarantees and may be scheduled to be applied in multiple cycles."

Per-workload train variants¶

One size doesn't fit all:

"For AI capacity, we have optimized domains that allow for different kinds of AI capacity, very strict SLOs, and a contract with services that allows them to avoid maintenance- train interruptions, if possible."

The pattern generalises: each workload class gets its own train definition — its own domain size, cycle time, SLO, and opt-out protocol. AI training, web serving, and storage might all have different trains on the same physical fleet.

Alignment and overlap¶

Multiple upgrades can be packed into the same train visit:

"If beneficial, upgrades can be aligned."

If two upgrades both require a reboot, doing them in the same drain saves an entire drain cycle.

At Meta scale, multiple trains are in flight simultaneously across the fleet — one overlapping rollout per workload class. The central serialiser (OpsPlanner) ensures two trains don't target the same host at the same time.

Safety primitives¶

From the Meta post, the train primitive comes with:

Autostop on buffer exhaustion. "Autostop of maintenance trains if maintenance or failure buffers are exhausted." The train halts rather than taking capacity below the combined (planned + failure) buffer floor.
Automatic offboarding of failing upgrades. A patch class observed to fail on multiple hosts is automatically removed from the active rollout — the train doesn't keep inflicting a bad upgrade.
Rollout phases. "Only well-tested changes reach global systems." New patches start on a small subset of trains, graduating to wider rollout as confidence grows.
Emergency trains. Reserved capacity for urgent patches outside the normal cycle.

Contrast with other rollout patterns¶

Lock-step fleet upgrade — whole fleet drained, upgraded, returned. Works at small scale. Doesn't scale.
Canary / phased deployment — one-time rollout from test → canary → global. Train is cyclic, continuous, covers all maintenance classes, not just one release.
patterns/rapid-fleet-patching-via-managed-service — the MongoDB-Atlas variant; same capability shape, different audience: Atlas patches customer instances on customer- defined maintenance windows, whereas Meta's trains patch internal fleet on Meta-defined domains with internal-service opt-out.
patterns/staged-rollout — the generic staging pattern. Maintenance train is a cyclic, capacity-preserving specialisation of staged rollout, where every visit is a stage and the domain is the stage unit.

Canonical instance¶

sources/2024-06-16-meta-maintaining-large-scale-ai-capacity-at-meta:

Used for all Meta capacity — compute and storage, not only AI.
Supports >30 maintenance operation classes, >50 updated components.
Feeds a million operations per day through OpsPlanner.
Reduces AI maintenance overhead "significantly" through per-AI domain optimisation (magnitudes not disclosed).

Caveats¶

Cycle time vs domain size trade-off is real. Smaller domain → faster cycle (more domains to visit) but smaller buffer; larger domain → slower cycle but smaller buffer fraction. Train throughput = domain size × drain rate.
Opt-out contract risk. Services that negotiate "avoid train interruption if possible" still need to accept emergency-train visits; otherwise you recreate the "silent override" failure mode maintenance windows names.
Verification cost. Pre-return verification must be fast enough not to bottleneck the train — Meta names three host-verification tasks; at higher counts, the verification time becomes the rate-limit on train throughput.

Seen in¶

sources/2024-06-16-meta-maintaining-large-scale-ai-capacity-at-meta — the originating source; Meta's compute+storage fleet, "all capacity minus one maintenance domain" contract, per-AI-workload train variants. Canonical wiki instance.

concepts/maintenance-domain — the sizing unit each train visit operates on.
concepts/fleet-patching — the capability class.
concepts/maintenance-window — the customer-facing scheduling contract; internal-service variant in Meta's fleet.
concepts/overlapping-rollouts — multiple trains run concurrently in the same fleet.
concepts/host-consistency-sliding-upgrade — the discipline that makes each train's lower-level upgrade safe for synchronised AI jobs.
concepts/blast-radius — the train's per-visit drain is a planned blast-radius decision.
patterns/staged-rollout — the generic parent pattern.
patterns/gradual-rollout-layered-by-stack-depth — the two-layer rollout shape each train runs.
patterns/rapid-fleet-patching-via-managed-service — the customer-facing sibling variant (MongoDB Atlas).
systems/opsplanner — Meta's orchestrator that owns train execution, buffers, and pre-return verification.