CONCEPT Cited by 1 source

GPU stall from storage¶

Definition¶

GPU stall from storage is the failure mode in distributed AI training where storage-fetch latency exceeds the time the GPU spends processing the current data batch, causing the GPU to idle while waiting for the next batch. Because modern training synchronizes state across hundreds of thousands of GPUs after every N steps, a single stalled GPU delays the entire training step for all GPUs.

Mechanism¶

Dataloaders prefetch the next batch from storage while the GPU processes the current batch (compute/I/O overlap)
If the storage-fetch latency exceeds the GPU processing time, the prefetch buffer is empty when the GPU finishes — the GPU stalls
During synchronized steps, all GPUs wait for the slowest peer — one stalled GPU cascades

Why pMax Matters More Than p50¶

For AI storage, the critical SLO is bounded pMax latency (worst-case), not average or median latency. A single outlier fetch — caused by a hot shard, a slow metadata lookup, or a laggard storage node — stalls the entire training job.

(Source: sources/2026-07-01-meta-ai-storage-blueprint-at-scale, "Why Latency Matters" section)

Seen in¶

sources/2026-07-01-meta-ai-storage-blueprint-at-scale — defining statement of the problem that motivated Meta's BLOB-storage re-architecture
sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale — GPU-failure taxonomy including I/O stalls