Skip to content

CONCEPT Cited by 1 source

Hardware reliability at scale

Definition

Hardware reliability at scale is the property of a large hardware fleet (thousands+ of GPUs / servers) that makes a single long-running job — a training run, a batch query, a streaming stage — feasible despite the fact that the probability of at least one hardware failure during the job window is near-certainty.

The problem is not "make individual nodes more reliable" (a vendor problem) — it's: given a non-negligible per-node failure rate, how do you arrange software/hardware/operations so the effective reliability experienced by the job is acceptable?

Meta's framing (2024-06)

From Meta's 2024-06-12 GenAI infrastructure post:

"As we increase the number of GPUs in a job, the likelihood of an interruption due to a hardware failure also increases. ... Ensuring that our hardware is reliable is important. We need to minimize the chances of a hardware failure interrupting a training job. This involves rigorous testing and quality control measures, and automation to quickly detect and remediate issues." (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)

Meta enumerates four simultaneously-stressed properties:

  1. Hardware reliability itself (minimise failure rate).
  2. Fast recovery on failure (reduce re-schedule overhead, fast training re-init).
  3. Efficient preservation of training state (checkpointing).
  4. Optimal connectivity between GPUs.

All four scale adversely with GPU count — they cannot be addressed independently.

The three operational axes Meta names

  • Rigorous testing + quality control — pre-deployment and in-service. Meta calls out monitoring and trend detection for repeat offenders.
  • Automation to quickly detect and remediate — you cannot have humans in the loop for failures at the rate a 24K-GPU cluster generates them.
  • Spare capacity to restart"having a job that spans the cluster makes it necessary to keep adequate spare capacity to restart the job as soon as possible." Unused fleet capacity is load-bearing, not waste.

Early-life vs aged-fleet distribution

Meta names an empirical pattern:

"This failure mode is seen more in the early life and settles as the server ages."

Applied specifically to GPUs-falling-off-PCIe and hardware-network-cable failures. A fleet is most failure-dense during its bring-up phase, which is precisely when you want to be training your frontier model on it. This creates a burn-in tension: deploy early (fail more) vs wait for burn-in (fall behind the research frontier).

(See concepts/gpu-training-failure-modes for the specific modes.)

Why software alone is insufficient

  • Silent-error modes exist — for example, a GPU that computes slightly wrong numbers will not stop a job but will ruin convergence. These cannot be detected by job-level liveness checks.
  • Fabric-level failures — e.g. a congested or flapping link between nodes — can degrade throughput without producing a failure event. Detection requires hardware telemetry.
  • Thermal throttling — a rack running hot will slow down; software sees a "random tail latency" that is actually a physics-of-cooling problem.

Meta names these implicitly: their 700 W air-cooled H100 build stresses cooling margins; their monitoring includes per-device DRAM/SRAM UCE rate thresholds.

Seen in

Last updated · 319 distilled / 1,201 read