PATTERN Cited by 1 source

Node quarantine and retest¶

Definition¶

Node quarantine and retest is a fleet-management pattern where nodes that fail any health check are immediately removed from the schedulable pool, isolated (quarantined), put through hardware resets and thorough re-testing, then either returned to the fleet if healthy or permanently removed. The pattern prevents degraded hardware from impacting workloads while preserving the option to reclaim transiently-failed nodes.

Lifecycle¶

Detection — any health check layer (bootstrap, continuous, or periodic) flags a failure.
Immediate removal — the node is cordoned and drained; no new workloads are scheduled.
Quarantine — the node enters an isolated state for diagnosis.
Reset + re-test — hardware resets are applied and the full active check suite is re-run.
Decision — if checks pass, the node re-enters the fleet; if not, it is permanently removed (and potentially RMA'd).

Design rationale¶

At GPU-fleet scale, transient failures are common (thermal events, single-bit upsets, port flaps). Permanently removing every node on first failure wastes expensive hardware. Conversely, immediately returning a failed node risks repeat failures on the next workload. The quarantine-and-retest cycle balances utilization against reliability by applying a time-gated re-validation step.

Seen in¶

sources/2026-07-01-databricks-gpu-reliability — Databricks' gpu-monitor implements this as the response to any check-layer failure.

patterns/multi-stage-health-check — the detection layers that feed into quarantine
systems/gpu-monitor — the system that orchestrates this lifecycle at Databricks
concepts/hardware-reliability-at-scale — the scaling context where quarantine becomes necessary

Node quarantine and retest¶

Definition¶

Lifecycle¶

Design rationale¶

Seen in¶

Related¶