Skip to content

PATTERN Cited by 1 source

Node quarantine and retest

Definition

Node quarantine and retest is a fleet-management pattern where nodes that fail any health check are immediately removed from the schedulable pool, isolated (quarantined), put through hardware resets and thorough re-testing, then either returned to the fleet if healthy or permanently removed. The pattern prevents degraded hardware from impacting workloads while preserving the option to reclaim transiently-failed nodes.

Lifecycle

  1. Detection — any health check layer (bootstrap, continuous, or periodic) flags a failure.
  2. Immediate removal — the node is cordoned and drained; no new workloads are scheduled.
  3. Quarantine — the node enters an isolated state for diagnosis.
  4. Reset + re-test — hardware resets are applied and the full active check suite is re-run.
  5. Decision — if checks pass, the node re-enters the fleet; if not, it is permanently removed (and potentially RMA'd).

Design rationale

At GPU-fleet scale, transient failures are common (thermal events, single-bit upsets, port flaps). Permanently removing every node on first failure wastes expensive hardware. Conversely, immediately returning a failed node risks repeat failures on the next workload. The quarantine-and-retest cycle balances utilization against reliability by applying a time-gated re-validation step.

Seen in

Last updated · 567 distilled / 1,685 read