Skip to content

PATTERN Cited by 1 source

Gradual rollout layered by stack depth

Problem

A single operational primitive must cover both ends of a cost- asymmetric spectrum:

  • Host-level components (firmware, kernel, drivers, OS) take "hours to install and configure or require rebooting the host." A whole-cluster lock-step upgrade is operationally infeasible at scale.
  • Job-container components (CUDA library, framework, model code) need to stay consistent across the cluster — synchronised AI training jobs cannot tolerate per-host version divergence.

The naive single-tier rollout is stuck: slow-enough to be safe on the lower layer = too slow to maintain consistency on the upper layer.

The pattern

Partition the host software stack by rollout cost and pin the layers with opposite policies:

  1. Identify the split point. For Meta, it's the job container boundary:
  2. Above: job container + CUDA library + framework + model — the job-facing layer.
  3. Below: driver + firmware + kernel + OS — the host-facing layer.
  4. Pin the upper layer across the cluster. One version at a time, cluster-wide. Restarts are cheap (container restart), so consistency is cheap to maintain.
  5. Slide the lower layer gradually. Different hosts can be on different lower-layer versions during a rollout window. Install is expensive, so this is the only economical path.
  6. Engineer the compatibility matrix. Every combination of (pinned-upper × in-flight-lower) must have been tested compatible.
  7. Gate pre-return verification. A host finishing a lower-layer upgrade cannot return to service until its new lower-layer version is verified compatible with the currently-pinned upper layer.

See concepts/host-consistency-sliding-upgrade for the underlying concept.

Why layer the rollout by stack depth

The pattern works because upgrade cost is a function of stack depth, and consistency cost is inverse of stack depth:

Stack layer Upgrade cost Restart mechanism Consistency needed?
Model weights / code Cheap Container restart Yes — synchronised jobs
CUDA library / framework Moderate Container rebuild Yes — synchronised jobs
Kernel / drivers Expensive Host reboot No — jobs indirect
Firmware Very expensive Host reboot + re-flash No — jobs indirect

The split is pragmatic: find the layer below which the application doesn't observe the version (directly), and above which it does.

Canonical instance

From sources/2024-06-16-meta-maintaining-large-scale-ai-capacity-at-meta:

"At Meta, we've ensured that jobs have a consistent stack but upgrade lower-level components in a gradual fashion. In contrast to this, the AI job itself, which includes the CUDA library, is always consistent. This distinction is necessary because lower-level components often require hours to install and configure or require rebooting the host, while higher- level components in the job container itself can be restarted fluidly."

The CUDA-library-pinning + firmware/driver-sliding is the specific split point Meta chose. Other domains would pin at different points:

  • Web serving: pin the application container; slide kernel below. (Most serving fleets already do this.)
  • Database fleet: pin the database binary + on-disk format; slide OS below.
  • Kubernetes fleet: pin the pod image; slide node OS below.

Mechanism requirements

  1. Strong pinning of the upper layer. When the cluster decides to switch CUDA versions, the switch is coordinated fleet-wide within a short window — no drift allowed. This implies container-level versioning, not per-host.
  2. Pre-return verification hook. Every lower-layer upgrade that returns a host to service must verify the pinned upper layer is still compatible. Meta's OpsPlanner owns this.
  3. Tested compatibility matrix. Published by the lower-layer teams, consumed by the pre-return gate. Every combination that can coexist in production must have been integration-tested.
  4. Tooling for rare upper-layer changes. When the pinned layer must itself change (CUDA major version bump), the pattern collapses — you need a coordinated upper-layer rollout, typically paired with a sliding-window lower-layer co-rollout. The Meta post acknowledges this:

    "We also added tooling for rare compatibility-breaking upgrades." The tooling is not described.

  5. Bad-host detection (concepts/bad-host-detection) must be tuned to distinguish "host misbehaving because of a bad lower-layer version" from "host misbehaving for any other reason" — otherwise your auto-drain incorrectly attributes blame.

Compared to alternatives

  • Single-tier sliding rollout. One version at a time across everything. Works on small scale; operationally infeasible at Meta scale per the source.
  • Per-job host selection. Job scheduler picks only hosts with compatible versions. Adds scheduling complexity; reduces effective fleet utilisation; still requires a compatibility matrix but doesn't solve the upgrade economics.
  • Lock-step whole-cluster upgrade. Drain the whole cluster, upgrade everything, return. Requires a window long enough to drain the whole cluster — Meta explicitly declares this infeasible.

Caveats

  • Finding the pin point is design work. The split is not obvious for systems without Meta's hard CUDA-library synchronisation requirement. Get it wrong and either (a) the pinned layer changes too often (loses the economic benefit) or (b) the drifting layer includes something the job observes (breaks synchronised workloads).
  • The compatibility matrix is combinatorial. N lower-layer versions × M upper-layer versions. Most teams compromise to "last two" × "current two" rather than full coverage.
  • Not applicable for stateless serving. Stateless workloads tolerate version divergence at any layer — they just retry on another host. The pattern's economics benefit shrinks proportional to how much consistency the workload genuinely needs.

Seen in

Last updated · 319 distilled / 1,201 read