Skip to content

CONCEPT Cited by 1 source

Host-consistency sliding upgrade

Definition

Host-consistency sliding upgrade is the rollout discipline of partitioning the host software stack into two layers and treating them with opposite policies:

  • Upper layer — job-facing. Kept consistent across the whole cluster at any given instant. For AI training at Meta, this is the "AI job itself, which includes the CUDA library." Restart cost is cheap (container restart, not host reboot), so consistency is cheap to maintain.
  • Lower layer — host-level. Allowed to drift during rollout. Firmware, kernel, drivers, OS packages. Install cost is expensive (hours to configure, may require reboot), so sliding-window rollouts are the only economical approach.

The rollout is gradual on the lower layer, instantaneous on the upper layer — and the compatibility matrix between them is engineered so any lower-layer version compatible with the pinned upper-layer version is acceptable during the drift window.

Why the two-layer split

Meta's rationale from sources/2024-06-16-meta-maintaining-large-scale-ai-capacity-at-meta:

"In contrast to this, the AI job itself, which includes the CUDA library, is always consistent. This distinction is necessary because lower-level components often require hours to install and configure or require rebooting the host, while higher-level components in the job container itself can be restarted fluidly."

Restart cost asymmetry is the load-bearing property:

  • A CUDA library version mismatch between hosts running a synchronised training job = whole-job fails. Keep consistent.
  • A firmware version mismatch, with drivers / kernels on compatible versions = job is unaffected. Allowed to drift.

Why consistency matters at all for synchronised workloads

Training jobs run synchronised collective communication across hosts — all-reduce, all-gather, reduce-scatter. If one host's CUDA library behaves differently, the whole collective goes wrong (hang, wrong result, crash on the slowest participant). Unlike stateless serving, you cannot just retry the request on a different host — the job is already committed to the full host set.

This is structurally parallel to "bad hosts are very bad" (bad-host detection): a slow or subtly-different host imposes whole-job cost, not proportional cost.

What this pattern is not

  • Not canary deployment — canary is "one host first, then many". Sliding upgrade is "all hosts, but heterogeneous, at a controlled rate."
  • Not feature flag — feature flags control behaviour from config; sliding upgrade controls binary/firmware from the host layer.
  • Not rolling restart — rolling restart keeps binaries consistent; sliding upgrade explicitly allows them to diverge at the lower layer.

Mechanism requirements

To operate this discipline safely you need:

  1. Explicit two-layer partition. Pin the job-container + CUDA layer. Everything below it is the drift-allowed layer.
  2. Compatibility matrix. Every combination of upper-layer pinned version × lower-layer rolled version must have been tested.
  3. Pre-return verification. A host finishing a lower-layer upgrade must verify compatibility with the currently-pinned upper layer before returning to service — Meta's OpsPlanner owns this gate.
  4. Rollout tooling for rare compat-breaking upgrades. The Meta post acknowledges this case exists; when the pinned layer must also change, you need a different, coordinated rollout — not the sliding pattern.

Compared to lock-step upgrade

The pattern exists because lock-step upgrades don't scale:

"In smaller environments it is often possible to keep clusters in a consistent state and upgrade the whole cluster and all of its firmware and software components in the same maintenance window. Doing this in a large, diverse environment like Meta, however, would introduce big risks and be operationally infeasible."

Lock-step requires a cluster-wide maintenance window long enough to drain the whole cluster, which at Meta's scale is "operationally infeasible" — you'd effectively never have capacity.

Seen in

Last updated · 319 distilled / 1,201 read