Skip to content

PATTERN Cited by 1 source

Multi-stage health check

Definition

A multi-stage health check architecture deploys different categories of verification at different stages of a node's lifecycle, because different failure modes are only catchable under different conditions. Rather than a single check suite, multiple layers operate at provisioning, during active use, and during idle periods — each targeting failure classes the others cannot reach.

The three-layer instantiation (Databricks)

Databricks' gpu-monitor implements this pattern with three layers:

Layer 1: Active bootstrap checks (deterministic failures)

Run at node provisioning and between workloads. Target: failures that can be reliably reproduced by a targeted test up front. Examples: GPU burn-in, peer connectivity, NCCL correctness, ECC health, PCIe topology.

Invariant: every workload starts on a node that just passed the full active check suite.

Layer 2: Passive continuous checks (non-deterministic failures)

Run on active nodes during workload execution. Target: failures that only emerge under sustained thermal/electrical/mechanical stress. Examples: NVLink lane degradation, clock throttling, RDMA port-down events, XID errors, thermal gradients.

Key design decision: these checks must be lightweight enough not to compete with the workload for resources, but sensitive enough to detect degradation before it propagates to training outcomes.

Layer 3: Periodic multi-node active probes (fabric failures)

Run on idle nodes between workloads. Target: inter-node fabric problems that no single node can surface alone. Examples: NCCL collective bandwidth sweeps across node groups at multiple payload sizes.

Trade-off: these are more expensive than single-node checks but catch a class of failures (degraded inter-node links, asymmetric routing) invisible to node-local monitoring.

Why single-stage doesn't work

  • Bootstrap-only misses load-dependent failures (thermal, sustained-stress degradation).
  • Continuous-only cannot catch failures that are deterministic but not yet triggered (bad ECC that corrupts on first write to a specific region).
  • Periodic-only misses timing-sensitive transient failures and leaves nodes unvalidated before workload start.

The compound probability argument (at 1024 GPUs / 30 days, ~57% chance of at least one failure) means all three failure classes will manifest — any single-layer approach leaves blind spots.

Seen in

Last updated · 567 distilled / 1,685 read