SYSTEM Cited by 1 source

gpu-monitor (Databricks)¶

Definition¶

gpu-monitor is Databricks AI's multi-stage health check and observability service that runs on every GPU node in their training fleet. It covers the entire node lifecycle — from first provisioning through active workload execution to idle periods between jobs — implementing the multi-stage health check pattern with three distinct verification layers.

Architecture¶

The system operates three layers of checks, each targeting different failure modes at different lifecycle stages:

1. Active bootstrap checks¶

Run at node provisioning and every time a node is cleaned between customer workloads. Catch deterministic failures that can be reliably surfaced by a targeted test:

GPU compute speed and burn-in validation
GPU-to-GPU peer connectivity across every pair (NVLink / NVSwitch health)
Intra-node NCCL all-reduce correctness and bandwidth
RDMA NIC bandwidth via host-local loopback
ECC and HBM memory health (including row-remap headroom)
PCIe topology and link integrity
NVIDIA DCGM diagnostics at level 2

2. Passive continuous checks¶

Run on every active node during workload execution. Catch non-deterministic failure modes that only emerge under sustained workload pressure:

NVLink lane status (any lane going down is flagged)
GPU clock throttling reasons (HW_SLOWDOWN, HW_THERMAL_SLOWDOWN, HW_POWER_BRAKE)
RDMA fabric port down detection (thresholded on cumulative downtime, not flap count)
Critical XID errors from kernel logs
PCIe AER uncorrectable errors
Thermal gradient between GPU core and HBM
NVSwitch error states

3. Periodic multi-node active checks¶

Run periodically on idle nodes between customer workloads. Validate inter-node fabric behaviour that no single node can surface on its own:

NCCL collective bandwidth probes across node groups
Sweeps payload sizes from 8 bytes to 2 GiB
Different pass criteria per payload-size regime (latency for small, BusBW for large)
Can be preempted when customer workloads need the nodes

Operational invariant¶

Every workload starts on a node that just passed the full active bootstrap check suite. Nodes failing any check layer are immediately removed from the fleet and enter the quarantine-and-retest cycle.

Seen in¶

sources/2026-07-01-databricks-gpu-reliability — first public disclosure of the system architecture.

systems/dcgm — NVIDIA's GPU monitoring layer that gpu-monitor wraps and extends
systems/nccl — the collective-communications library whose bandwidth gpu-monitor validates
patterns/multi-stage-health-check — the architectural pattern gpu-monitor implements
patterns/node-quarantine-and-retest — the operational response when checks fail
concepts/gpu-training-failure-modes — the failure taxonomy gpu-monitor is designed to catch