CONCEPT Cited by 1 source
StatefulSet highest-ordinal scale-in¶
Definition¶
A Kubernetes StatefulSet scales in by removing the pod with the highest ordinal — if the StatefulSet has pods statefulset-0, statefulset-1, …, statefulset-N, a scale-down to N pods removes statefulset-N. The semantic is deterministic and documented: this is the reverse of the scale-up order (pods are created 0, 1, …, N and removed N, …, 1, 0).
"The pods in a StatefulSet are numbered and the one with the highest number is always chosen for removal when scaling in."
(Source: sources/2024-06-20-zalando-failing-to-auto-scale-elasticsearch-in-kubernetes)
Why the semantic matters¶
Operators and scheduling plans can rely on which pod will be removed next. An operator orchestrating a graceful drain knows the target pod by arithmetic, not by inspection; it can pre-drain the specific pod instead of racing a general "whichever pod the controller picks."
The coupling problem¶
The semantic becomes a liability when the highest-ordinal pod is not necessarily the "best" pod to remove. StatefulSet's removal order is a function of ordinal, not of the pod's current role, load, zone, or readiness. Three concrete couplings:
- Zone-aware workloads: the highest-ordinal pod may be the only pod in one AZ. Removing it violates zone-spread; drains get stuck. Zalando Lounge 2024-06-20 is the canonical wiki instance.
- Leader-follower workloads: if
statefulset-0is the current leader andstatefulset-Nis the most recent follower, removing the follower is correct and the semantic serves you. If a leadership change has flippedstatefulset-Nto leader (unusual but possible), removing the leader is destructive and the semantic hurts. - Load-hottest-last: if workload distribution has made
statefulset-Nthe currently hottest pod, removing it triggers the largest redistribution. Not wrong per se, just possibly not the optimal eviction.
Alternatives¶
- Pod-level anti-affinity + per-zone floors at the scheduling layer, so any ordinal is equally safe to remove.
- Custom operator that selects eviction candidates by criteria other than ordinal — PlanetScale's Vitess Operator runs plain pods instead of StatefulSets specifically so that the operator can pick candidates by Vitess-aware logic; the StatefulSet ordinal semantic is given up intentionally because it was the wrong abstraction.
- StatefulSet
podManagementPolicy: Parallelchanges startup/shutdown ordering but not the scale-in-picks-highest-ordinal rule. - StatefulSet
updateStrategyis for rolling updates, not scale-in — it's orthogonal to this semantic.
Interaction with ephemeral storage¶
The highest-ordinal semantic works well when ordinals are stably bound to something useful (e.g. an EBS volume pinned to an AZ). Under ephemeral storage with cross-zone drift, ordinals carry no such semantic guarantee, and the deterministic "highest first" rule picks an arbitrary pod that may happen to be structurally wrong to remove.
Seen in¶
- sources/2024-06-20-zalando-failing-to-auto-scale-elasticsearch-in-kubernetes — canonical wiki instance.
es-data-production-v2-6, the highest-ordinal pod, was the only pod ineu-central-1a; StatefulSet scale-in targeted it; shard-allocation awareness refused to move its shards; the drain got stuck.