Skip to content

SYSTEM Cited by 1 source

DCGM

Definition

DCGM (NVIDIA Data Center GPU Manager) is NVIDIA's monitoring and management framework for data-centre GPUs. It provides GPU health monitoring, diagnostics, and telemetry — including throttle-reason signals (HW_SLOWDOWN, HW_THERMAL_SLOWDOWN, HW_POWER_BRAKE) that indicate degraded GPU operation.

Role in fleet health checking

Databricks' gpu-monitor incorporates DCGM diagnostics at level 2 as part of its active bootstrap check suite. DCGM throttle-reason telemetry also feeds into the passive continuous monitoring layer for detecting silent slowdowns during workload execution.

(Source: sources/2026-07-01-databricks-gpu-reliability)

Seen in

Last updated · 567 distilled / 1,685 read