SYSTEM Cited by 1 source
gpu-monitor (Databricks)¶
Definition¶
gpu-monitor is Databricks AI's multi-stage health check and observability service that runs on every GPU node in their training fleet. It covers the entire node lifecycle — from first provisioning through active workload execution to idle periods between jobs — implementing the multi-stage health check pattern with three distinct verification layers.
Architecture¶
The system operates three layers of checks, each targeting different failure modes at different lifecycle stages:
1. Active bootstrap checks¶
Run at node provisioning and every time a node is cleaned between customer workloads. Catch deterministic failures that can be reliably surfaced by a targeted test:
- GPU compute speed and burn-in validation
- GPU-to-GPU peer connectivity across every pair (NVLink / NVSwitch health)
- Intra-node NCCL all-reduce correctness and bandwidth
- RDMA NIC bandwidth via host-local loopback
- ECC and HBM memory health (including row-remap headroom)
- PCIe topology and link integrity
- NVIDIA DCGM diagnostics at level 2
2. Passive continuous checks¶
Run on every active node during workload execution. Catch non-deterministic failure modes that only emerge under sustained workload pressure:
- NVLink lane status (any lane going down is flagged)
- GPU clock throttling reasons (
HW_SLOWDOWN,HW_THERMAL_SLOWDOWN,HW_POWER_BRAKE) - RDMA fabric port down detection (thresholded on cumulative downtime, not flap count)
- Critical XID errors from kernel logs
- PCIe AER uncorrectable errors
- Thermal gradient between GPU core and HBM
- NVSwitch error states
3. Periodic multi-node active checks¶
Run periodically on idle nodes between customer workloads. Validate inter-node fabric behaviour that no single node can surface on its own:
- NCCL collective bandwidth probes across node groups
- Sweeps payload sizes from 8 bytes to 2 GiB
- Different pass criteria per payload-size regime (latency for small, BusBW for large)
- Can be preempted when customer workloads need the nodes
Operational invariant¶
Every workload starts on a node that just passed the full active bootstrap check suite. Nodes failing any check layer are immediately removed from the fleet and enter the quarantine-and-retest cycle.
Seen in¶
- sources/2026-07-01-databricks-gpu-reliability — first public disclosure of the system architecture.
Related¶
- systems/dcgm — NVIDIA's GPU monitoring layer that gpu-monitor wraps and extends
- systems/nccl — the collective-communications library whose bandwidth gpu-monitor validates
- patterns/multi-stage-health-check — the architectural pattern gpu-monitor implements
- patterns/node-quarantine-and-retest — the operational response when checks fail
- concepts/gpu-training-failure-modes — the failure taxonomy gpu-monitor is designed to catch