CONCEPT Cited by 1 source

Hardware reliability at scale¶

Definition¶

Hardware reliability at scale is the property of a large hardware fleet (thousands+ of GPUs / servers) that makes a single long-running job — a training run, a batch query, a streaming stage — feasible despite the fact that the probability of at least one hardware failure during the job window is near-certainty.

The problem is not "make individual nodes more reliable" (a vendor problem) — it's: given a non-negligible per-node failure rate, how do you arrange software/hardware/operations so the effective reliability experienced by the job is acceptable?

Meta's framing (2024-06)¶

From Meta's 2024-06-12 GenAI infrastructure post:

"As we increase the number of GPUs in a job, the likelihood of an interruption due to a hardware failure also increases. ... Ensuring that our hardware is reliable is important. We need to minimize the chances of a hardware failure interrupting a training job. This involves rigorous testing and quality control measures, and automation to quickly detect and remediate issues." (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)

Meta enumerates four simultaneously-stressed properties:

Hardware reliability itself (minimise failure rate).
Fast recovery on failure (reduce re-schedule overhead, fast training re-init).
Efficient preservation of training state (checkpointing).
Optimal connectivity between GPUs.

All four scale adversely with GPU count — they cannot be addressed independently.

The three operational axes Meta names¶

Rigorous testing + quality control — pre-deployment and in-service. Meta calls out monitoring and trend detection for repeat offenders.
Automation to quickly detect and remediate — you cannot have humans in the loop for failures at the rate a 24K-GPU cluster generates them.
Spare capacity to restart — "having a job that spans the cluster makes it necessary to keep adequate spare capacity to restart the job as soon as possible." Unused fleet capacity is load-bearing, not waste.

Early-life vs aged-fleet distribution¶

Meta names an empirical pattern:

"This failure mode is seen more in the early life and settles as the server ages."

Applied specifically to GPUs-falling-off-PCIe and hardware-network-cable failures. A fleet is most failure-dense during its bring-up phase, which is precisely when you want to be training your frontier model on it. This creates a burn-in tension: deploy early (fail more) vs wait for burn-in (fall behind the research frontier).

(See concepts/gpu-training-failure-modes for the specific modes.)

Why software alone is insufficient¶

Silent-error modes exist — for example, a GPU that computes slightly wrong numbers will not stop a job but will ruin convergence. These cannot be detected by job-level liveness checks.
Fabric-level failures — e.g. a congested or flapping link between nodes — can degrade throughput without producing a failure event. Detection requires hardware telemetry.
Thermal throttling — a rack running hot will slow down; software sees a "random tail latency" that is actually a physics-of-cooling problem.

Meta names these implicitly: their 700 W air-cooled H100 build stresses cooling margins; their monitoring includes per-device DRAM/SRAM UCE rate thresholds.

Seen in¶

sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale — canonical wiki reference at 24K-GPU scale.

concepts/gpu-training-failure-modes — the specific failure modes this concept is the management-layer framing of.
concepts/training-checkpoint — the state-preservation primitive hardware-reliability concerns motivate.
concepts/blast-radius — related framing for the consequences of a failure; blast-radius focuses on containment, this concept on rate.
concepts/tail-latency-at-scale — a sibling reliability-at-scale concept on the serving side.
systems/meta-genai-cluster-roce / systems/meta-genai-cluster-infiniband.