PATTERN Cited by 1 source
Multi-stage health check¶
Definition¶
A multi-stage health check architecture deploys different categories of verification at different stages of a node's lifecycle, because different failure modes are only catchable under different conditions. Rather than a single check suite, multiple layers operate at provisioning, during active use, and during idle periods — each targeting failure classes the others cannot reach.
The three-layer instantiation (Databricks)¶
Databricks' gpu-monitor implements this pattern with three layers:
Layer 1: Active bootstrap checks (deterministic failures)¶
Run at node provisioning and between workloads. Target: failures that can be reliably reproduced by a targeted test up front. Examples: GPU burn-in, peer connectivity, NCCL correctness, ECC health, PCIe topology.
Invariant: every workload starts on a node that just passed the full active check suite.
Layer 2: Passive continuous checks (non-deterministic failures)¶
Run on active nodes during workload execution. Target: failures that only emerge under sustained thermal/electrical/mechanical stress. Examples: NVLink lane degradation, clock throttling, RDMA port-down events, XID errors, thermal gradients.
Key design decision: these checks must be lightweight enough not to compete with the workload for resources, but sensitive enough to detect degradation before it propagates to training outcomes.
Layer 3: Periodic multi-node active probes (fabric failures)¶
Run on idle nodes between workloads. Target: inter-node fabric problems that no single node can surface alone. Examples: NCCL collective bandwidth sweeps across node groups at multiple payload sizes.
Trade-off: these are more expensive than single-node checks but catch a class of failures (degraded inter-node links, asymmetric routing) invisible to node-local monitoring.
Why single-stage doesn't work¶
- Bootstrap-only misses load-dependent failures (thermal, sustained-stress degradation).
- Continuous-only cannot catch failures that are deterministic but not yet triggered (bad ECC that corrupts on first write to a specific region).
- Periodic-only misses timing-sensitive transient failures and leaves nodes unvalidated before workload start.
The compound probability argument (at 1024 GPUs / 30 days, ~57% chance of at least one failure) means all three failure classes will manifest — any single-layer approach leaves blind spots.
Seen in¶
- sources/2026-07-01-databricks-gpu-reliability — canonical description with Databricks' gpu-monitor as exemplar.
Related¶
- patterns/node-quarantine-and-retest — the operational response triggered by any layer
- systems/gpu-monitor — Databricks' implementation of this pattern
- concepts/hardware-reliability-at-scale — the scaling law that motivates multi-layer checking