CONCEPT Cited by 1 source
Hardware reliability at scale¶
Definition¶
Hardware reliability at scale is the property of a large hardware fleet (thousands+ of GPUs / servers) that makes a single long-running job — a training run, a batch query, a streaming stage — feasible despite the fact that the probability of at least one hardware failure during the job window is near-certainty.
The problem is not "make individual nodes more reliable" (a vendor problem) — it's: given a non-negligible per-node failure rate, how do you arrange software/hardware/operations so the effective reliability experienced by the job is acceptable?
Meta's framing (2024-06)¶
From Meta's 2024-06-12 GenAI infrastructure post:
"As we increase the number of GPUs in a job, the likelihood of an interruption due to a hardware failure also increases. ... Ensuring that our hardware is reliable is important. We need to minimize the chances of a hardware failure interrupting a training job. This involves rigorous testing and quality control measures, and automation to quickly detect and remediate issues." (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)
Meta enumerates four simultaneously-stressed properties:
- Hardware reliability itself (minimise failure rate).
- Fast recovery on failure (reduce re-schedule overhead, fast training re-init).
- Efficient preservation of training state (checkpointing).
- Optimal connectivity between GPUs.
All four scale adversely with GPU count — they cannot be addressed independently.
The three operational axes Meta names¶
- Rigorous testing + quality control — pre-deployment and in-service. Meta calls out monitoring and trend detection for repeat offenders.
- Automation to quickly detect and remediate — you cannot have humans in the loop for failures at the rate a 24K-GPU cluster generates them.
- Spare capacity to restart — "having a job that spans the cluster makes it necessary to keep adequate spare capacity to restart the job as soon as possible." Unused fleet capacity is load-bearing, not waste.
Early-life vs aged-fleet distribution¶
Meta names an empirical pattern:
"This failure mode is seen more in the early life and settles as the server ages."
Applied specifically to GPUs-falling-off-PCIe and hardware-network-cable failures. A fleet is most failure-dense during its bring-up phase, which is precisely when you want to be training your frontier model on it. This creates a burn-in tension: deploy early (fail more) vs wait for burn-in (fall behind the research frontier).
(See concepts/gpu-training-failure-modes for the specific modes.)
Why software alone is insufficient¶
- Silent-error modes exist — for example, a GPU that computes slightly wrong numbers will not stop a job but will ruin convergence. These cannot be detected by job-level liveness checks.
- Fabric-level failures — e.g. a congested or flapping link between nodes — can degrade throughput without producing a failure event. Detection requires hardware telemetry.
- Thermal throttling — a rack running hot will slow down; software sees a "random tail latency" that is actually a physics-of-cooling problem.
Meta names these implicitly: their 700 W air-cooled H100 build stresses cooling margins; their monitoring includes per-device DRAM/SRAM UCE rate thresholds.
Seen in¶
- sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale — canonical wiki reference at 24K-GPU scale.
Related¶
- concepts/gpu-training-failure-modes — the specific failure modes this concept is the management-layer framing of.
- concepts/training-checkpoint — the state-preservation primitive hardware-reliability concerns motivate.
- concepts/blast-radius — related framing for the consequences of a failure; blast-radius focuses on containment, this concept on rate.
- concepts/tail-latency-at-scale — a sibling reliability-at-scale concept on the serving side.
- systems/meta-genai-cluster-roce / systems/meta-genai-cluster-infiniband.