Skip to content

PATTERN Cited by 1 source

Sequential node cordoning

What it is

Sequential node cordoning is the node-replacement pattern where nodes are cordoned one at a time (not in parallel), and each cordon is followed by a verification checkpoint before the next cordon starts. If any checkpoint surfaces cluster-health issues, the campaign pauses or rolls back — it doesn't auto-proceed.

Contrasted with parallel cordoning, where many nodes are cordoned concurrently to maximize throughput of a node-replacement campaign. Parallel is faster but concentrates risk; sequential is slower but bounds blast radius to one node at a time.

Why sequential wins at scale

At fleet scale, parallel cordoning has three failure modes that aren't visible at small scale:

  1. Scheduler overload. Tens of parallel drains produce tens of thousands of simultaneous pod re-schedules. The scheduler's decision loop gets saturated; decisions fall behind.
  2. Transient capacity shortage. Each drained node temporarily removes its pods from service. If the remaining unscathed nodes can't absorb them all at once, pods enter Pending and workloads degrade.
  3. Catastrophe concentration. If the underlying reason for cordoning is a bad update (wrong AMI, misconfigured EC2NodeClass), parallel cordoning applies the bug to many nodes before anyone notices.

Sequential cordoning bounds the worst case: you lose one node's worth of capacity at a time, and any propagating issue triggers a checkpoint before the next node is touched.

Salesforce canonical instance

The 2026-01-12 Karpenter migration post describes the move from parallel to sequential after the initial parallel approach failed:

"The initial migration approach of cordoning Karpenter nodes in parallel led to unexpected cluster health issues. To address this, the team refined their strategy by implementing sequential node cordoning, adding manual verification checkpoints with rollback capabilities, and deploying enhanced monitoring for early detection of cluster instability. This experience reinforced that even with modern infrastructure tooling, careful orchestration of node maintenance remains crucial for system reliability." (Source: sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters)

Three components named:

  1. Sequential (not parallel).
  2. Manual verification checkpoints with rollback capability.
  3. Enhanced monitoring for early instability detection.

Shape

  1. One cordon at a time. Pick a node; cordon it.
  2. Drain with PDB respect. Evict pods one at a time, respecting PDBs. Wait for scheduler to place replacements elsewhere.
  3. Cluster-health checkpoint. Verify — manually or via automated gates — that the cluster is in a healthy state: no pods pending beyond normal, no PDB violations, no degraded services.
  4. Advance or halt. If healthy, cordon the next node. If not healthy, halt the campaign; trigger rollback if the issue is severe enough.
  5. Enhanced monitoring. Finer-grained signals than regular cluster monitoring — per-node / per-workload health metrics that can detect instability earlier than the normal alert budget.

When parallel is still appropriate

  • Very small clusters where the total cordon time is already short.
  • Disposable workloads (batch / experimental) where disruption is acceptable.
  • Known-safe campaigns — e.g. post-migration, when the runbook has soaked over many sequential runs and parallelism is a proven safe optimization.

The pattern isn't "never cordon in parallel" — it's "sequential is the safe default at scale until you have evidence parallel is safe for your specific campaign."

Seen in

Last updated · 200 distilled / 1,178 read