PATTERN Cited by 2 sources
Disruption-Budget-Guarded Upgrades¶
What it is¶
Disruption-budget-guarded upgrades is the pattern of protecting workloads against platform-driven node churn using three composable primitives:
- Maintenance window — pin when the platform is allowed to terminate and replace nodes (typically off-peak hours).
- Pod Disruption Budgets (PDBs) — bound how many pod replicas can be evicted simultaneously during a drain.
- Node Disruption Budgets — bound how many nodes in the cluster can be replaced concurrently.
This is the canonical customer-retained safety contract when operating on a managed-data-plane K8s service (e.g. EKS Auto Mode) where the platform actively terminates nodes on its own upgrade cadence.
Why each primitive is load-bearing¶
1. Maintenance window — when the churn happens¶
Without it, the platform chooses freely when to upgrade. With it, the platform can only do it during the customer-nominated window.
- Prevents business-hours churn — traffic volume, SRE on-call posture, and error budgets align better with late-night/weekend upgrades.
- Coordinates with other maintenance — DB maintenance windows, external-integration downtime schedules, etc.
2. Pod Disruption Budgets — service-level availability during drain¶
Without PDBs, Kubernetes's eviction API will drain a node by deleting pods one at a time with no respect for replica topology — three replicas on the same node get deleted back-to-back, taking the service down for reconciliation lag.
With a PDB like minAvailable: 2 on a 3-replica deployment, the
scheduler blocks evictions that would leave fewer than 2 running
until the first replacement pod is Ready elsewhere.
3. Node Disruption Budgets — cluster-level capacity during upgrade¶
Without NDBs, the platform's node-upgrade campaign can terminate many nodes concurrently, pushing cluster capacity below the sum of Deployment replica counts even if PDBs are respected. The result: PDBs satisfied locally, but scheduling starvation globally — pods are Evicted, can't schedule, and the cluster enters a degraded state.
NDBs throttle node replacement rate to stay above the combined replica-placement capacity.
Why this pattern emerges under managed-data-plane K8s¶
The pattern is structurally a response to the shifted shared-responsibility line of services like EKS Auto Mode:
- Pre-Auto-Mode: customer controlled when nodes were upgraded (they scheduled it); disruption control was a correctness-only concern.
- Post-Auto-Mode: AWS controls the upgrade cadence (typically weekly AMI replacements). Customer no longer picks the moment — customer picks the constraints around how fast it can happen.
The three primitives are the three dimensions of "fast enough to keep up with platform cadence, slow enough to preserve availability."
Canonical instance¶
Generali Malaysia's EKS Auto Mode adoption:
"The team had to create disruption control configurations to prevent those disruptions from impacting workloads. For example, they specified a maintenance window during off-peak hours for those upgrades. They also specified Pod Disruption Budgets and Node Disruptions Budgets to make sure critical applications would not see all the pods of a micro-service being terminated at the same time." (Source: sources/2026-03-23-aws-generali-malaysia-eks-auto-mode)
The three primitives are explicitly named together as one compound safeguard.
Pairs with stateless-only discipline¶
Generali's additional operational discipline — "only allow stateless micro-services", "treat the underlying pods as immutable" — is what makes the pattern safe and cheap rather than safe and expensive. Stateful pods would force the customer to also manage StatefulSet replacement, volume reattachment, and leader-election handoff under managed churn. Stateless pods just need to come up on a new node; the three primitives above are sufficient.
Caveats¶
- Single-replica workloads can't be PDB-protected. A PDB requiring at least one pod blocks all drain on a 1-replica deployment. Such workloads must be scaled to ≥2 or accept disruption.
- NDBs are platform-specific. Not all managed K8s services expose NDBs as a first-class primitive — the pattern's applicability depends on the platform surfacing the control.
- PDBs deadlock drain at maximal restrictiveness.
minAvailable= total replicas → drain never progresses. Must leave headroom.
Seen in¶
- sources/2026-03-23-aws-generali-malaysia-eks-auto-mode — Generali's compound safety contract under EKS Auto Mode: off-peak maintenance window + PDBs + NDBs, paired with stateless-only and immutable-pod discipline. Canonical wiki reference for the managed-data-plane variant of the pattern.
- sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters — Salesforce's 1,000-cluster Karpenter migration adds the customer-managed-autoscaler variant of the same compound safeguard: the autoscaler (Karpenter) — not the platform — drives the node-replacement cadence, but the safety contract is isomorphic. Salesforce extends the pattern with two additional primitives the Generali instance didn't surface: OPA-enforced PDB admission validation (PDB hygiene as a governance primitive, not an app-team knob), and explicit singleton-workload protection via guaranteed-pod-lifetime + workload-aware disruption policies (because PDBs structurally can't protect 1-replica pods). Operational-lesson extension: parallel cordoning destabilized clusters; the working form is sequential cordoning with verification checkpoints.
Related¶
- concepts/pod-disruption-budget — the primitive.
- concepts/shared-responsibility-model — why this pattern becomes load-bearing under managed K8s services.
- concepts/managed-data-plane — the platform-level property that forces the pattern in the Generali variant.
- concepts/singleton-workload — the workload class PDBs can't protect; Salesforce's extension uses guaranteed-pod-lifetime + workload-aware policies.
- systems/eks-auto-mode — canonical managed-data-plane platform.
- systems/karpenter — canonical customer-managed-autoscaler whose consolidation campaigns this pattern guards.
- systems/open-policy-agent — admission-layer enforcement for PDB correctness (Salesforce extension).
- systems/kubernetes — the underlying primitives.
- patterns/sequential-node-cordoning — the execution-level extension Salesforce learned the hard way.