DATABRICKS 2026-07-01

How we keep GPUs reliable across Databricks AI¶

Summary¶

Databricks AI describes their approach to GPU fleet reliability at scale. At cluster sizes of 256–1024+ GPUs, hardware failure during a training run transitions from possible to expected (~19% probability of at least one failure for a 256-GPU/30-day job; ~57% at 1024 GPUs). The post introduces gpu-monitor, a multi-stage health check and observability service that runs on every GPU node covering the entire node lifecycle. The architecture comprises three complementary layers: active bootstrap checks at provisioning time, passive continuous checks during workload execution, and periodic multi-node active probes validating inter-node fabric behaviour between workloads. The system also details how stress-testing with diverse cutting-edge workloads (RL training, agentic coding models, document intelligence) surfaces operational issues before they reach production.

Key Takeaways¶

Failure probability compounds with scale. Assuming a conservative 1% annualized per-GPU failure event rate, a 256-GPU/30-day job has ~19% probability of encountering at least one failure; at 1024 GPUs this climbs to ~57%. Failures are expected, not exceptional.
Three categories of GPU failure. (a) Crashed jobs — typically surfacing as NCCL watchdog timeout, opaque to root cause; (b) Silent slowdowns — degraded throughput from thermal throttling, link downgrade, or memory bandwidth degradation with no visible error signal; (c) Numerical corruption — ECC-uncorrectable faults propagating incorrect values, manifesting as NaN losses or quality regressions discovered late.
NCCL_IB_TIMEOUT is the real watchdog. The low-level InfiniBand transport timeout (~7s effective) fires long before PyTorch's NCCL watchdog timeout (typically 10min). A single port-down event exceeding ~7s tears the NCCL connection irreversibly — the run is committed to crash before the watchdog can even react. Databricks tuned their defaults to be more resilient.
Cumulative downtime, not flap count, is the meaningful signal. A single long InfiniBand port-down event can crash a multi-day run just as effectively as many short flaps. This insight drove a shift from flap-count thresholds to cumulative-downtime thresholds in health checks.
Multi-stage health check architecture (gpu-monitor). Three layers: (a) Active bootstrap checks at provisioning and between workloads — deterministic failures caught up front (GPU burn-in, peer connectivity, NCCL all-reduce correctness, ECC/HBM health, PCIe topology, DCGM L2 diagnostics); (b) Passive continuous checks on active nodes — non-deterministic failures that emerge under load (NVLink lane status, clock throttling, RDMA port down, XID errors, PCIe AER errors, thermal gradients, NVSwitch errors); (c) Periodic multi-node active probes — inter-node fabric validation that no single node can surface alone (NCCL collective bandwidth sweeps from 8B to 2GiB across node groups).
Node quarantine pattern. Nodes failing any health check — bootstrap or continuous — are immediately removed from the fleet, quarantined, put through resets and thorough re-testing, then either returned to the fleet or permanently removed. Every workload starts on a node that just passed the full check suite.
NCCL bandwidth pass criteria are payload-size-dependent. Small messages (KB range) exercise low-latency protocols (LL/LL128) with p95 latency as pass criterion; medium messages (MB range) cross algorithm-switching thresholds (tree→ring); large messages (GB range) exercise chunking/pipelining with BusBW as criterion. Hardware issues often surface in only one code path.
Stress testing with production-diverse workloads as discovery mechanism. RL workloads (training + inference + reward in tight loops), agentic coding models (inference-heavy evaluations alongside training), and document intelligence (heavy image-based data loading) each stress the platform differently, surfacing fabric flakiness, thermal hotspots, and collective-communication edge cases.

Operational Numbers¶

Metric	Value
Annualized per-GPU failure event rate (conservative)	~1%
Failure probability, 256 GPU × 30 days	~19%
Failure probability, 1024 GPU × 30 days	~57%
NCCL_IB_TIMEOUT effective default (with retries)	~7 seconds
PyTorch NCCL watchdog timeout (typical)	10 minutes
All-reduce BusBW at 2 GiB payload (representative)	445 GB/s
All-reduce BusBW pass threshold at 2 GiB	≥350 GB/s
All-reduce p95 latency at 1 KB payload	120 µs
All-reduce p95 latency pass threshold at 1 KB	≤250 µs

Systems Mentioned¶

gpu-monitor — Databricks' multi-stage health check and observability service running on every GPU node.
NCCL — NVIDIA Collective Communications Library; the collective-operations layer underneath distributed training.
DCGM — NVIDIA Data Center GPU Manager; provides throttle-reason telemetry and L2 diagnostics.
NVLink / NVSwitch — NVIDIA intra-node GPU-to-GPU interconnect; lane status monitored continuously.
InfiniBand — RDMA fabric for inter-node communication; port-down detection drives quarantine.

Concepts Extracted¶

concepts/gpu-training-failure-modes — extended with Databricks' three-category taxonomy (crashed jobs, silent slowdowns, numerical corruption)
concepts/hardware-reliability-at-scale — extended with Databricks' probabilistic framing (1% annualized, compound probability formula)
concepts/nccl-ib-timeout — the low-level InfiniBand transport timeout that fires before the PyTorch watchdog
concepts/cumulative-downtime-vs-flap-count — the operational insight that duration matters more than count for training-run impact

Patterns Extracted¶

patterns/multi-stage-health-check — bootstrap + continuous + periodic multi-node architecture
patterns/node-quarantine-and-retest — immediate removal, quarantine, reset, re-test, return-or-remove
patterns/stress-test-with-diverse-workloads — using production-diverse workloads as a failure-discovery mechanism

Caveats¶

This is part 1 of a series; deeper content on recovery mechanisms, RL-specific reliability, and larger-scale techniques is forthcoming.
The 1% annualized failure rate is described as "conservative" — actual rates may be higher depending on GPU generation and workload intensity.
The article focuses on training workloads; inference reliability is mentioned but not elaborated.

Source¶

sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale — Meta's earlier GPU reliability post (same failure taxonomy, different operational response)
concepts/gpu-training-failure-modes
concepts/hardware-reliability-at-scale
patterns/multi-stage-health-check
systems/gpu-monitor