How we keep GPUs reliable across Databricks AI¶
Summary¶
Databricks AI describes their approach to GPU fleet reliability at scale. At cluster sizes of 256–1024+ GPUs, hardware failure during a training run transitions from possible to expected (~19% probability of at least one failure for a 256-GPU/30-day job; ~57% at 1024 GPUs). The post introduces gpu-monitor, a multi-stage health check and observability service that runs on every GPU node covering the entire node lifecycle. The architecture comprises three complementary layers: active bootstrap checks at provisioning time, passive continuous checks during workload execution, and periodic multi-node active probes validating inter-node fabric behaviour between workloads. The system also details how stress-testing with diverse cutting-edge workloads (RL training, agentic coding models, document intelligence) surfaces operational issues before they reach production.
Key Takeaways¶
-
Failure probability compounds with scale. Assuming a conservative 1% annualized per-GPU failure event rate, a 256-GPU/30-day job has ~19% probability of encountering at least one failure; at 1024 GPUs this climbs to ~57%. Failures are expected, not exceptional.
-
Three categories of GPU failure. (a) Crashed jobs — typically surfacing as NCCL watchdog timeout, opaque to root cause; (b) Silent slowdowns — degraded throughput from thermal throttling, link downgrade, or memory bandwidth degradation with no visible error signal; (c) Numerical corruption — ECC-uncorrectable faults propagating incorrect values, manifesting as NaN losses or quality regressions discovered late.
-
NCCL_IB_TIMEOUT is the real watchdog. The low-level InfiniBand transport timeout (~7s effective) fires long before PyTorch's NCCL watchdog timeout (typically 10min). A single port-down event exceeding ~7s tears the NCCL connection irreversibly — the run is committed to crash before the watchdog can even react. Databricks tuned their defaults to be more resilient.
-
Cumulative downtime, not flap count, is the meaningful signal. A single long InfiniBand port-down event can crash a multi-day run just as effectively as many short flaps. This insight drove a shift from flap-count thresholds to cumulative-downtime thresholds in health checks.
-
Multi-stage health check architecture (
gpu-monitor). Three layers: (a) Active bootstrap checks at provisioning and between workloads — deterministic failures caught up front (GPU burn-in, peer connectivity, NCCL all-reduce correctness, ECC/HBM health, PCIe topology, DCGM L2 diagnostics); (b) Passive continuous checks on active nodes — non-deterministic failures that emerge under load (NVLink lane status, clock throttling, RDMA port down, XID errors, PCIe AER errors, thermal gradients, NVSwitch errors); (c) Periodic multi-node active probes — inter-node fabric validation that no single node can surface alone (NCCL collective bandwidth sweeps from 8B to 2GiB across node groups). -
Node quarantine pattern. Nodes failing any health check — bootstrap or continuous — are immediately removed from the fleet, quarantined, put through resets and thorough re-testing, then either returned to the fleet or permanently removed. Every workload starts on a node that just passed the full check suite.
-
NCCL bandwidth pass criteria are payload-size-dependent. Small messages (KB range) exercise low-latency protocols (LL/LL128) with p95 latency as pass criterion; medium messages (MB range) cross algorithm-switching thresholds (tree→ring); large messages (GB range) exercise chunking/pipelining with BusBW as criterion. Hardware issues often surface in only one code path.
-
Stress testing with production-diverse workloads as discovery mechanism. RL workloads (training + inference + reward in tight loops), agentic coding models (inference-heavy evaluations alongside training), and document intelligence (heavy image-based data loading) each stress the platform differently, surfacing fabric flakiness, thermal hotspots, and collective-communication edge cases.
Operational Numbers¶
| Metric | Value |
|---|---|
| Annualized per-GPU failure event rate (conservative) | ~1% |
| Failure probability, 256 GPU × 30 days | ~19% |
| Failure probability, 1024 GPU × 30 days | ~57% |
| NCCL_IB_TIMEOUT effective default (with retries) | ~7 seconds |
| PyTorch NCCL watchdog timeout (typical) | 10 minutes |
| All-reduce BusBW at 2 GiB payload (representative) | 445 GB/s |
| All-reduce BusBW pass threshold at 2 GiB | ≥350 GB/s |
| All-reduce p95 latency at 1 KB payload | 120 µs |
| All-reduce p95 latency pass threshold at 1 KB | ≤250 µs |
Systems Mentioned¶
- gpu-monitor — Databricks' multi-stage health check and observability service running on every GPU node.
- NCCL — NVIDIA Collective Communications Library; the collective-operations layer underneath distributed training.
- DCGM — NVIDIA Data Center GPU Manager; provides throttle-reason telemetry and L2 diagnostics.
- NVLink / NVSwitch — NVIDIA intra-node GPU-to-GPU interconnect; lane status monitored continuously.
- InfiniBand — RDMA fabric for inter-node communication; port-down detection drives quarantine.
Concepts Extracted¶
- concepts/gpu-training-failure-modes — extended with Databricks' three-category taxonomy (crashed jobs, silent slowdowns, numerical corruption)
- concepts/hardware-reliability-at-scale — extended with Databricks' probabilistic framing (1% annualized, compound probability formula)
- concepts/nccl-ib-timeout — the low-level InfiniBand transport timeout that fires before the PyTorch watchdog
- concepts/cumulative-downtime-vs-flap-count — the operational insight that duration matters more than count for training-run impact
Patterns Extracted¶
- patterns/multi-stage-health-check — bootstrap + continuous + periodic multi-node architecture
- patterns/node-quarantine-and-retest — immediate removal, quarantine, reset, re-test, return-or-remove
- patterns/stress-test-with-diverse-workloads — using production-diverse workloads as a failure-discovery mechanism
Caveats¶
- This is part 1 of a series; deeper content on recovery mechanisms, RL-specific reliability, and larger-scale techniques is forthcoming.
- The 1% annualized failure rate is described as "conservative" — actual rates may be higher depending on GPU generation and workload intensity.
- The article focuses on training workloads; inference reliability is mentioned but not elaborated.
Source¶
- Original: https://www.databricks.com/blog/how-we-keep-gpus-reliable-across-databricks-ai
- Raw markdown:
raw/databricks/2026-07-01-how-we-keep-gpus-reliable-across-databricks-ai-b76c700d.md
Related¶
- sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale — Meta's earlier GPU reliability post (same failure taxonomy, different operational response)
- concepts/gpu-training-failure-modes
- concepts/hardware-reliability-at-scale
- patterns/multi-stage-health-check
- systems/gpu-monitor