PATTERN Cited by 1 source

Multi-stage health check¶

Definition¶

A multi-stage health check architecture deploys different categories of verification at different stages of a node's lifecycle, because different failure modes are only catchable under different conditions. Rather than a single check suite, multiple layers operate at provisioning, during active use, and during idle periods — each targeting failure classes the others cannot reach.

The three-layer instantiation (Databricks)¶

Databricks' gpu-monitor implements this pattern with three layers:

Layer 1: Active bootstrap checks (deterministic failures)¶

Run at node provisioning and between workloads. Target: failures that can be reliably reproduced by a targeted test up front. Examples: GPU burn-in, peer connectivity, NCCL correctness, ECC health, PCIe topology.

Invariant: every workload starts on a node that just passed the full active check suite.

Layer 2: Passive continuous checks (non-deterministic failures)¶

Run on active nodes during workload execution. Target: failures that only emerge under sustained thermal/electrical/mechanical stress. Examples: NVLink lane degradation, clock throttling, RDMA port-down events, XID errors, thermal gradients.

Key design decision: these checks must be lightweight enough not to compete with the workload for resources, but sensitive enough to detect degradation before it propagates to training outcomes.

Layer 3: Periodic multi-node active probes (fabric failures)¶

Run on idle nodes between workloads. Target: inter-node fabric problems that no single node can surface alone. Examples: NCCL collective bandwidth sweeps across node groups at multiple payload sizes.

Trade-off: these are more expensive than single-node checks but catch a class of failures (degraded inter-node links, asymmetric routing) invisible to node-local monitoring.

Why single-stage doesn't work¶

Bootstrap-only misses load-dependent failures (thermal, sustained-stress degradation).
Continuous-only cannot catch failures that are deterministic but not yet triggered (bad ECC that corrupts on first write to a specific region).
Periodic-only misses timing-sensitive transient failures and leaves nodes unvalidated before workload start.

The compound probability argument (at 1024 GPUs / 30 days, ~57% chance of at least one failure) means all three failure classes will manifest — any single-layer approach leaves blind spots.

Seen in¶

sources/2026-07-01-databricks-gpu-reliability — canonical description with Databricks' gpu-monitor as exemplar.

patterns/node-quarantine-and-retest — the operational response triggered by any layer
systems/gpu-monitor — Databricks' implementation of this pattern
concepts/hardware-reliability-at-scale — the scaling law that motivates multi-layer checking