Skip to content

DATABRICKS 2026-07-01

Read original ↗

How we keep GPUs reliable across Databricks AI

Summary

Databricks AI describes their approach to GPU fleet reliability at scale. At cluster sizes of 256–1024+ GPUs, hardware failure during a training run transitions from possible to expected (~19% probability of at least one failure for a 256-GPU/30-day job; ~57% at 1024 GPUs). The post introduces gpu-monitor, a multi-stage health check and observability service that runs on every GPU node covering the entire node lifecycle. The architecture comprises three complementary layers: active bootstrap checks at provisioning time, passive continuous checks during workload execution, and periodic multi-node active probes validating inter-node fabric behaviour between workloads. The system also details how stress-testing with diverse cutting-edge workloads (RL training, agentic coding models, document intelligence) surfaces operational issues before they reach production.

Key Takeaways

  1. Failure probability compounds with scale. Assuming a conservative 1% annualized per-GPU failure event rate, a 256-GPU/30-day job has ~19% probability of encountering at least one failure; at 1024 GPUs this climbs to ~57%. Failures are expected, not exceptional.

  2. Three categories of GPU failure. (a) Crashed jobs — typically surfacing as NCCL watchdog timeout, opaque to root cause; (b) Silent slowdowns — degraded throughput from thermal throttling, link downgrade, or memory bandwidth degradation with no visible error signal; (c) Numerical corruption — ECC-uncorrectable faults propagating incorrect values, manifesting as NaN losses or quality regressions discovered late.

  3. NCCL_IB_TIMEOUT is the real watchdog. The low-level InfiniBand transport timeout (~7s effective) fires long before PyTorch's NCCL watchdog timeout (typically 10min). A single port-down event exceeding ~7s tears the NCCL connection irreversibly — the run is committed to crash before the watchdog can even react. Databricks tuned their defaults to be more resilient.

  4. Cumulative downtime, not flap count, is the meaningful signal. A single long InfiniBand port-down event can crash a multi-day run just as effectively as many short flaps. This insight drove a shift from flap-count thresholds to cumulative-downtime thresholds in health checks.

  5. Multi-stage health check architecture (gpu-monitor). Three layers: (a) Active bootstrap checks at provisioning and between workloads — deterministic failures caught up front (GPU burn-in, peer connectivity, NCCL all-reduce correctness, ECC/HBM health, PCIe topology, DCGM L2 diagnostics); (b) Passive continuous checks on active nodes — non-deterministic failures that emerge under load (NVLink lane status, clock throttling, RDMA port down, XID errors, PCIe AER errors, thermal gradients, NVSwitch errors); (c) Periodic multi-node active probes — inter-node fabric validation that no single node can surface alone (NCCL collective bandwidth sweeps from 8B to 2GiB across node groups).

  6. Node quarantine pattern. Nodes failing any health check — bootstrap or continuous — are immediately removed from the fleet, quarantined, put through resets and thorough re-testing, then either returned to the fleet or permanently removed. Every workload starts on a node that just passed the full check suite.

  7. NCCL bandwidth pass criteria are payload-size-dependent. Small messages (KB range) exercise low-latency protocols (LL/LL128) with p95 latency as pass criterion; medium messages (MB range) cross algorithm-switching thresholds (tree→ring); large messages (GB range) exercise chunking/pipelining with BusBW as criterion. Hardware issues often surface in only one code path.

  8. Stress testing with production-diverse workloads as discovery mechanism. RL workloads (training + inference + reward in tight loops), agentic coding models (inference-heavy evaluations alongside training), and document intelligence (heavy image-based data loading) each stress the platform differently, surfacing fabric flakiness, thermal hotspots, and collective-communication edge cases.

Operational Numbers

Metric Value
Annualized per-GPU failure event rate (conservative) ~1%
Failure probability, 256 GPU × 30 days ~19%
Failure probability, 1024 GPU × 30 days ~57%
NCCL_IB_TIMEOUT effective default (with retries) ~7 seconds
PyTorch NCCL watchdog timeout (typical) 10 minutes
All-reduce BusBW at 2 GiB payload (representative) 445 GB/s
All-reduce BusBW pass threshold at 2 GiB ≥350 GB/s
All-reduce p95 latency at 1 KB payload 120 µs
All-reduce p95 latency pass threshold at 1 KB ≤250 µs

Systems Mentioned

  • gpu-monitor — Databricks' multi-stage health check and observability service running on every GPU node.
  • NCCL — NVIDIA Collective Communications Library; the collective-operations layer underneath distributed training.
  • DCGM — NVIDIA Data Center GPU Manager; provides throttle-reason telemetry and L2 diagnostics.
  • NVLink / NVSwitch — NVIDIA intra-node GPU-to-GPU interconnect; lane status monitored continuously.
  • InfiniBand — RDMA fabric for inter-node communication; port-down detection drives quarantine.

Concepts Extracted

Patterns Extracted

Caveats

  • This is part 1 of a series; deeper content on recovery mechanisms, RL-specific reliability, and larger-scale techniques is forthcoming.
  • The 1% annualized failure rate is described as "conservative" — actual rates may be higher depending on GPU generation and workload intensity.
  • The article focuses on training workloads; inference reliability is mentioned but not elaborated.

Source

Last updated · 567 distilled / 1,685 read