Skip to content

SYSTEM Cited by 1 source

gpu-monitor (Databricks)

Definition

gpu-monitor is Databricks AI's multi-stage health check and observability service that runs on every GPU node in their training fleet. It covers the entire node lifecycle — from first provisioning through active workload execution to idle periods between jobs — implementing the multi-stage health check pattern with three distinct verification layers.

Architecture

The system operates three layers of checks, each targeting different failure modes at different lifecycle stages:

1. Active bootstrap checks

Run at node provisioning and every time a node is cleaned between customer workloads. Catch deterministic failures that can be reliably surfaced by a targeted test:

  • GPU compute speed and burn-in validation
  • GPU-to-GPU peer connectivity across every pair (NVLink / NVSwitch health)
  • Intra-node NCCL all-reduce correctness and bandwidth
  • RDMA NIC bandwidth via host-local loopback
  • ECC and HBM memory health (including row-remap headroom)
  • PCIe topology and link integrity
  • NVIDIA DCGM diagnostics at level 2

2. Passive continuous checks

Run on every active node during workload execution. Catch non-deterministic failure modes that only emerge under sustained workload pressure:

  • NVLink lane status (any lane going down is flagged)
  • GPU clock throttling reasons (HW_SLOWDOWN, HW_THERMAL_SLOWDOWN, HW_POWER_BRAKE)
  • RDMA fabric port down detection (thresholded on cumulative downtime, not flap count)
  • Critical XID errors from kernel logs
  • PCIe AER uncorrectable errors
  • Thermal gradient between GPU core and HBM
  • NVSwitch error states

3. Periodic multi-node active checks

Run periodically on idle nodes between customer workloads. Validate inter-node fabric behaviour that no single node can surface on its own:

  • NCCL collective bandwidth probes across node groups
  • Sweeps payload sizes from 8 bytes to 2 GiB
  • Different pass criteria per payload-size regime (latency for small, BusBW for large)
  • Can be preempted when customer workloads need the nodes

Operational invariant

Every workload starts on a node that just passed the full active bootstrap check suite. Nodes failing any check layer are immediately removed from the fleet and enter the quarantine-and-retest cycle.

Seen in

Last updated · 567 distilled / 1,685 read