Skip to content

CONCEPT Cited by 1 source

StatefulSet highest-ordinal scale-in

Definition

A Kubernetes StatefulSet scales in by removing the pod with the highest ordinal — if the StatefulSet has pods statefulset-0, statefulset-1, …, statefulset-N, a scale-down to N pods removes statefulset-N. The semantic is deterministic and documented: this is the reverse of the scale-up order (pods are created 0, 1, …, N and removed N, …, 1, 0).

"The pods in a StatefulSet are numbered and the one with the highest number is always chosen for removal when scaling in."

(Source: sources/2024-06-20-zalando-failing-to-auto-scale-elasticsearch-in-kubernetes)

Why the semantic matters

Operators and scheduling plans can rely on which pod will be removed next. An operator orchestrating a graceful drain knows the target pod by arithmetic, not by inspection; it can pre-drain the specific pod instead of racing a general "whichever pod the controller picks."

The coupling problem

The semantic becomes a liability when the highest-ordinal pod is not necessarily the "best" pod to remove. StatefulSet's removal order is a function of ordinal, not of the pod's current role, load, zone, or readiness. Three concrete couplings:

  • Zone-aware workloads: the highest-ordinal pod may be the only pod in one AZ. Removing it violates zone-spread; drains get stuck. Zalando Lounge 2024-06-20 is the canonical wiki instance.
  • Leader-follower workloads: if statefulset-0 is the current leader and statefulset-N is the most recent follower, removing the follower is correct and the semantic serves you. If a leadership change has flipped statefulset-N to leader (unusual but possible), removing the leader is destructive and the semantic hurts.
  • Load-hottest-last: if workload distribution has made statefulset-N the currently hottest pod, removing it triggers the largest redistribution. Not wrong per se, just possibly not the optimal eviction.

Alternatives

  • Pod-level anti-affinity + per-zone floors at the scheduling layer, so any ordinal is equally safe to remove.
  • Custom operator that selects eviction candidates by criteria other than ordinalPlanetScale's Vitess Operator runs plain pods instead of StatefulSets specifically so that the operator can pick candidates by Vitess-aware logic; the StatefulSet ordinal semantic is given up intentionally because it was the wrong abstraction.
  • StatefulSet podManagementPolicy: Parallel changes startup/shutdown ordering but not the scale-in-picks-highest-ordinal rule.
  • StatefulSet updateStrategy is for rolling updates, not scale-in — it's orthogonal to this semantic.

Interaction with ephemeral storage

The highest-ordinal semantic works well when ordinals are stably bound to something useful (e.g. an EBS volume pinned to an AZ). Under ephemeral storage with cross-zone drift, ordinals carry no such semantic guarantee, and the deterministic "highest first" rule picks an arbitrary pod that may happen to be structurally wrong to remove.

Seen in

Last updated · 501 distilled / 1,218 read