CONCEPT Cited by 1 source

GPU training failure modes¶

Definition¶

The specific hardware-failure modes that dominate interruptions in large GPU training fleets. Meta's 2024-06-12 post enumerates three as the most frequent modes observed at their 24K-GPU H100 scale.

The three most frequent modes (Meta 2024)¶

1. GPUs falling off PCIe¶

"GPUs are not detected by the host on PCIe. There are several reasons for this failure, but this failure mode is seen more in the early life and settles as the server ages."

The GPU literally disappears from the host's PCIe topology mid-run. Diagnostics: lspci, nvidia-smi no longer enumerate the device. Triggers: PCIe electrical issue, firmware, thermal, physical mechanical. Early-life-biased — a burn-in problem.

2. DRAM and SRAM uncorrectable errors (UCE)¶

"Uncorrectable errors are common in memories, and we monitor and identify repeat offenders, track against thresholds, and initiate RMAs when error rates exceed vendor thresholds."

Correctable errors (CEs) can be fixed in flight (ECC); uncorrectable errors propagate as bit flips into the computation and/or crash the process. Meta's operational answer: per-device tracking, repeat-offender detection, RMA-initiation above vendor-specified thresholds. This is a case where the software discipline is more interesting than the vendor spec: the right response to UCE is not a retry but a proactive RMA before the rate indicates silent-error risk.

3. Hardware network cable failures¶

"In the general category of unreachable servers, these failures are also seen most often in the early life of the server."

Most commonly manifests as a server that becomes unreachable over the training fabric. Like GPUs-falling-off-PCIe, early-life-biased — bad cable seating, wrong-gauge cable, connector damage.

The "early-life" pattern¶

Two of three named modes are early-life-biased:

"This failure mode is seen more in the early life and settles as the server ages."

Operationally, this means fleet bring-up is the most failure-dense phase. At 24K-GPU scale, that is also the phase where you most want stability (to train frontier models on the new substrate). Consequences:

Burn-in budget matters. Allocate calendar time for bring-up failures to surface before committing frontier research to the cluster.
Early-life automation matters more than aged-fleet automation. If you only tune your remediation automation for aged-fleet patterns, you will pay the most during the highest-stakes phase.
RMA relationships matter. Cable/PCIe/UCE issues are vendor-RMA paths, not software fixes.

Not enumerated (but also real)¶

Meta's post names three; other known GPU training failure modes:

Silent data corruption (GPU computes wrong numbers without error) — the most insidious mode; only detected via redundant computation or downstream training-loss-spike signal.
NVLink errors — intra-node link degradation impacting tensor-parallel throughput.
InfiniBand / RoCE link flapping — inter-node fabric intermittents.
Thermal throttling — not a hard failure, but a throughput cliff.
Power-domain failures — a whole rack goes dark; all GPUs in the domain lose state simultaneously.
Soft ECC scrubbing patterns — patterns that predict UCE probability.

None of these are enumerated in Meta's 2024-06-12 post; we flag them as known-elsewhere modes to revisit as more sources are ingested.

Operational response stack¶

Meta names a three-part operational response:

Monitoring + threshold-based alerting (per-device UCE rate vs vendor thresholds).
Automation for detection + remediation (cannot keep up with 24K-GPU failure rate via human intervention).
Preventive action on early-warning signals ("we monitor failures and can sometimes take preventive measures to mitigate downtime").

Seen in¶

sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale — canonical wiki reference enumerating the three modes.

concepts/hardware-reliability-at-scale — the framing this concept concretises.
concepts/training-checkpoint — the state-preservation primitive these failures motivate.
systems/nvidia-h100 — the GPU substrate Meta's observed rates apply to.
systems/grand-teton — the platform whose mechanical/thermal design interacts with failure rate.
systems/meta-genai-cluster-roce / systems/meta-genai-cluster-infiniband.