PATTERN Cited by 1 source
Sequential node cordoning¶
What it is¶
Sequential node cordoning is the node-replacement pattern where nodes are cordoned one at a time (not in parallel), and each cordon is followed by a verification checkpoint before the next cordon starts. If any checkpoint surfaces cluster-health issues, the campaign pauses or rolls back — it doesn't auto-proceed.
Contrasted with parallel cordoning, where many nodes are cordoned concurrently to maximize throughput of a node-replacement campaign. Parallel is faster but concentrates risk; sequential is slower but bounds blast radius to one node at a time.
Why sequential wins at scale¶
At fleet scale, parallel cordoning has three failure modes that aren't visible at small scale:
- Scheduler overload. Tens of parallel drains produce tens of thousands of simultaneous pod re-schedules. The scheduler's decision loop gets saturated; decisions fall behind.
- Transient capacity shortage. Each drained node temporarily
removes its pods from service. If the remaining unscathed
nodes can't absorb them all at once, pods enter
Pendingand workloads degrade. - Catastrophe concentration. If the underlying reason for
cordoning is a bad update (wrong AMI, misconfigured
EC2NodeClass), parallel cordoning applies the bug to many nodes before anyone notices.
Sequential cordoning bounds the worst case: you lose one node's worth of capacity at a time, and any propagating issue triggers a checkpoint before the next node is touched.
Salesforce canonical instance¶
The 2026-01-12 Karpenter migration post describes the move from parallel to sequential after the initial parallel approach failed:
"The initial migration approach of cordoning Karpenter nodes in parallel led to unexpected cluster health issues. To address this, the team refined their strategy by implementing sequential node cordoning, adding manual verification checkpoints with rollback capabilities, and deploying enhanced monitoring for early detection of cluster instability. This experience reinforced that even with modern infrastructure tooling, careful orchestration of node maintenance remains crucial for system reliability." (Source: sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters)
Three components named:
- Sequential (not parallel).
- Manual verification checkpoints with rollback capability.
- Enhanced monitoring for early instability detection.
Shape¶
- One cordon at a time. Pick a node; cordon it.
- Drain with PDB respect. Evict pods one at a time, respecting PDBs. Wait for scheduler to place replacements elsewhere.
- Cluster-health checkpoint. Verify — manually or via automated gates — that the cluster is in a healthy state: no pods pending beyond normal, no PDB violations, no degraded services.
- Advance or halt. If healthy, cordon the next node. If not healthy, halt the campaign; trigger rollback if the issue is severe enough.
- Enhanced monitoring. Finer-grained signals than regular cluster monitoring — per-node / per-workload health metrics that can detect instability earlier than the normal alert budget.
When parallel is still appropriate¶
- Very small clusters where the total cordon time is already short.
- Disposable workloads (batch / experimental) where disruption is acceptable.
- Known-safe campaigns — e.g. post-migration, when the runbook has soaked over many sequential runs and parallelism is a proven safe optimization.
The pattern isn't "never cordon in parallel" — it's "sequential is the safe default at scale until you have evidence parallel is safe for your specific campaign."
Related¶
- patterns/disruption-budget-guarded-upgrades — the upstream compound pattern; sequential cordoning is the how-to for the actual node replacement step.
- concepts/pod-disruption-budget — the primitive respected during each drain.
- systems/karpenter — the autoscaler whose node-replacement campaigns benefit most from this pattern.
- systems/aws-eks — the managed K8s service Salesforce ran the pattern on.
Seen in¶
- sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters — Salesforce's pivot from parallel to sequential-with-checkpoints after the initial parallel approach caused cluster health issues. Canonical wiki instance of the anti-parallel lesson.