CONCEPT Cited by 1 source
Overlapping rollouts¶
Definition¶
Overlapping rollouts is the architectural acceptance that in a sufficiently large fleet, multiple concurrent upgrade campaigns — touching different components, different subsets of hosts, with different rollout speeds — cannot be serialised into non-overlapping time windows. The response is not to serialise, but to engineer per-component compatibility so concurrent rollouts don't interfere.
At any given moment at Meta:
- Multiple firmware classes are rolling out.
- Multiple driver versions are rolling out.
- OS patches are rolling out.
- Verification tasks are running.
- Bad-host auto-drains are firing.
- Emergency trains may be running.
The fleet is never in a single coherent version state. Meta accepts this rather than fighting it:
"Given the variety of upgrades, we have a large amount of overlapping inflight changes at any given time, including some that are consistently being applied, such as verification tasks." (Source: sources/2024-06-16-meta-maintaining-large-scale-ai-capacity-at-meta)
The architectural flip¶
Small-scale operations pursue version coherence: one cluster, one known version state, one upgrade window. Meta explicitly names this as infeasible:
"In smaller environments it is often possible to keep clusters in a consistent state and upgrade the whole cluster and all of its firmware and software components in the same maintenance window. Doing this in a large, diverse environment like Meta, however, would introduce big risks and be operationally infeasible. Instead, we ensure components are compatible with each other and roll component upgrades up in a sliding fashion."
The architectural flip:
| Small-scale | Hyperscale |
|---|---|
| Enforce single version state | Enforce pairwise component compatibility |
| Serialise rollouts across time | Run rollouts in parallel |
| Maintenance window per upgrade | Continuous rolling maintenance |
| One consistent cluster | One set of compatible-pair constraints |
What you need to make this work¶
To run overlapping rollouts without chaos:
- Pairwise compatibility guarantees. For every pair of component versions that can coexist on the same host / cluster / job, tests must have validated coexistence.
- Central serialiser for host-level conflicts. Two operations cannot simultaneously drain + upgrade the same host. OpsPlanner is Meta's.
- Sliding-window discipline for heavy operations. Lower- level upgrades (hours-to-install) need host-consistency sliding so different hosts can be on different lower-layer versions simultaneously without breaking jobs.
- Capacity-preserving concurrency. Overlapping rollouts are only safe if the combined drained capacity stays within the planned maintenance buffer. This is why maintenance domain sizing matters — concurrent operations share the buffer.
- Alignment when beneficial. When two operations would
benefit from coinciding (e.g. two upgrades both needing
reboot), the orchestrator can align them:
"So you can have many overlapping upgrades, and, if beneficial, upgrades can be aligned."
Why this is different from "staged rollout"¶
A staged rollout is a single rollout progressing through phases. Overlapping rollouts is the property of the fleet running many independent staged rollouts concurrently — each at a different phase, each touching different components, with no global synchronisation gate between them.
Staged rollout is a pattern applied within one operation; overlapping rollouts is the property of the fleet across operations.
Operational risks this creates¶
- Combinatorial test explosion. N component classes × M versions each → a large compatibility matrix to validate. Teams typically compromise: test supported pairs within N-1 and N, not all of them.
- Diagnostic ambiguity. When a host misbehaves, "which of the 5 in-flight upgrades caused it?" is harder than with serialised rollouts. Mitigated by per-component rollback and bad-host detection.
- Deadlock risk. Two operations each waiting on buffer capacity the other holds. The source names this as a load-bearing concern OpsPlanner handles: "a built-in handover flow that ensures correct escalation behavior and avoids overlaps and deadlocks."
Seen in¶
- sources/2024-06-16-meta-maintaining-large-scale-ai-capacity-at-meta — the originating source; Meta names the principle and the mitigations. Canonical wiki instance.
Related¶
- concepts/fleet-patching — the capability class; overlapping rollouts is the concurrency property of a mature fleet- patching substrate.
- concepts/host-consistency-sliding-upgrade — the specific discipline that makes overlapping rollouts safe for synchronised workloads.
- concepts/maintenance-domain — the unit domain-sizing must account for when rollouts overlap.
- concepts/maintenance-window — the customer-facing contract; overlapping rollouts make the "which window" question multi-dimensional.
- patterns/maintenance-train — the rollout pattern; multiple trains can be in flight simultaneously.
- patterns/gradual-rollout-layered-by-stack-depth — the per-operation shape each train runs.
- patterns/staged-rollout — the per-rollout pattern; this concept is the fleet-level property of running many of them concurrently.
- systems/opsplanner — Meta's serialiser for host-level conflicts between overlapping rollouts.