SYSTEM Cited by 1 source
DCGM¶
Definition¶
DCGM (NVIDIA Data Center GPU Manager) is NVIDIA's monitoring and management framework for data-centre GPUs. It provides GPU health monitoring, diagnostics, and telemetry — including throttle-reason signals (HW_SLOWDOWN, HW_THERMAL_SLOWDOWN, HW_POWER_BRAKE) that indicate degraded GPU operation.
Role in fleet health checking¶
Databricks' gpu-monitor incorporates DCGM diagnostics at level 2 as part of its active bootstrap check suite. DCGM throttle-reason telemetry also feeds into the passive continuous monitoring layer for detecting silent slowdowns during workload execution.
(Source: sources/2026-07-01-databricks-gpu-reliability)
Seen in¶
- sources/2026-07-01-databricks-gpu-reliability — cited as part of active bootstrap checks (L2 diagnostics) and passive continuous checks (throttle reasons).
Related¶
- systems/gpu-monitor — the higher-level system that wraps DCGM
- concepts/gpu-training-failure-modes — the failure modes DCGM signals help detect (silent slowdowns)