PATTERN Cited by 1 source
Node quarantine and retest¶
Definition¶
Node quarantine and retest is a fleet-management pattern where nodes that fail any health check are immediately removed from the schedulable pool, isolated (quarantined), put through hardware resets and thorough re-testing, then either returned to the fleet if healthy or permanently removed. The pattern prevents degraded hardware from impacting workloads while preserving the option to reclaim transiently-failed nodes.
Lifecycle¶
- Detection — any health check layer (bootstrap, continuous, or periodic) flags a failure.
- Immediate removal — the node is cordoned and drained; no new workloads are scheduled.
- Quarantine — the node enters an isolated state for diagnosis.
- Reset + re-test — hardware resets are applied and the full active check suite is re-run.
- Decision — if checks pass, the node re-enters the fleet; if not, it is permanently removed (and potentially RMA'd).
Design rationale¶
At GPU-fleet scale, transient failures are common (thermal events, single-bit upsets, port flaps). Permanently removing every node on first failure wastes expensive hardware. Conversely, immediately returning a failed node risks repeat failures on the next workload. The quarantine-and-retest cycle balances utilization against reliability by applying a time-gated re-validation step.
Seen in¶
- sources/2026-07-01-databricks-gpu-reliability — Databricks' gpu-monitor implements this as the response to any check-layer failure.
Related¶
- patterns/multi-stage-health-check — the detection layers that feed into quarantine
- systems/gpu-monitor — the system that orchestrates this lifecycle at Databricks
- concepts/hardware-reliability-at-scale — the scaling context where quarantine becomes necessary