CONCEPT Cited by 1 source

Overlapping rollouts¶

Definition¶

Overlapping rollouts is the architectural acceptance that in a sufficiently large fleet, multiple concurrent upgrade campaigns — touching different components, different subsets of hosts, with different rollout speeds — cannot be serialised into non-overlapping time windows. The response is not to serialise, but to engineer per-component compatibility so concurrent rollouts don't interfere.

At any given moment at Meta:

Multiple firmware classes are rolling out.
Multiple driver versions are rolling out.
OS patches are rolling out.
Verification tasks are running.
Bad-host auto-drains are firing.
Emergency trains may be running.

The fleet is never in a single coherent version state. Meta accepts this rather than fighting it:

"Given the variety of upgrades, we have a large amount of overlapping inflight changes at any given time, including some that are consistently being applied, such as verification tasks." (Source: sources/2024-06-16-meta-maintaining-large-scale-ai-capacity-at-meta)

The architectural flip¶

Small-scale operations pursue version coherence: one cluster, one known version state, one upgrade window. Meta explicitly names this as infeasible:

"In smaller environments it is often possible to keep clusters in a consistent state and upgrade the whole cluster and all of its firmware and software components in the same maintenance window. Doing this in a large, diverse environment like Meta, however, would introduce big risks and be operationally infeasible. Instead, we ensure components are compatible with each other and roll component upgrades up in a sliding fashion."

The architectural flip:

Small-scale	Hyperscale
Enforce single version state	Enforce pairwise component compatibility
Serialise rollouts across time	Run rollouts in parallel
Maintenance window per upgrade	Continuous rolling maintenance
One consistent cluster	One set of compatible-pair constraints

What you need to make this work¶

To run overlapping rollouts without chaos:

Pairwise compatibility guarantees. For every pair of component versions that can coexist on the same host / cluster / job, tests must have validated coexistence.
Central serialiser for host-level conflicts. Two operations cannot simultaneously drain + upgrade the same host. OpsPlanner is Meta's.
Sliding-window discipline for heavy operations. Lower- level upgrades (hours-to-install) need host-consistency sliding so different hosts can be on different lower-layer versions simultaneously without breaking jobs.
Capacity-preserving concurrency. Overlapping rollouts are only safe if the combined drained capacity stays within the planned maintenance buffer. This is why maintenance domain sizing matters — concurrent operations share the buffer.
Alignment when beneficial. When two operations would benefit from coinciding (e.g. two upgrades both needing reboot), the orchestrator can align them:

"So you can have many overlapping upgrades, and, if beneficial, upgrades can be aligned."

Why this is different from "staged rollout"¶

A staged rollout is a single rollout progressing through phases. Overlapping rollouts is the property of the fleet running many independent staged rollouts concurrently — each at a different phase, each touching different components, with no global synchronisation gate between them.

Staged rollout is a pattern applied within one operation; overlapping rollouts is the property of the fleet across operations.

Operational risks this creates¶

Combinatorial test explosion. N component classes × M versions each → a large compatibility matrix to validate. Teams typically compromise: test supported pairs within N-1 and N, not all of them.
Diagnostic ambiguity. When a host misbehaves, "which of the 5 in-flight upgrades caused it?" is harder than with serialised rollouts. Mitigated by per-component rollback and bad-host detection.
Deadlock risk. Two operations each waiting on buffer capacity the other holds. The source names this as a load-bearing concern OpsPlanner handles: "a built-in handover flow that ensures correct escalation behavior and avoids overlaps and deadlocks."

Seen in¶

sources/2024-06-16-meta-maintaining-large-scale-ai-capacity-at-meta — the originating source; Meta names the principle and the mitigations. Canonical wiki instance.

concepts/fleet-patching — the capability class; overlapping rollouts is the concurrency property of a mature fleet- patching substrate.
concepts/host-consistency-sliding-upgrade — the specific discipline that makes overlapping rollouts safe for synchronised workloads.
concepts/maintenance-domain — the unit domain-sizing must account for when rollouts overlap.
concepts/maintenance-window — the customer-facing contract; overlapping rollouts make the "which window" question multi-dimensional.
patterns/maintenance-train — the rollout pattern; multiple trains can be in flight simultaneously.
patterns/gradual-rollout-layered-by-stack-depth — the per-operation shape each train runs.
patterns/staged-rollout — the per-rollout pattern; this concept is the fleet-level property of running many of them concurrently.
systems/opsplanner — Meta's serialiser for host-level conflicts between overlapping rollouts.