Skip to content

CONCEPT

StatefulSet highest-ordinal scale-in

Definition

A Kubernetes StatefulSet scales in by removing the pod with the highest ordinal — if the StatefulSet has pods statefulset-0, statefulset-1, …, statefulset-N, a scale-down to N pods removes statefulset-N. The semantic is deterministic and documented: this is the reverse of the scale-up order (pods are created 0, 1, …, N and removed N, …, 1, 0).

"The pods in a StatefulSet are numbered and the one with the highest number is always chosen for removal when scaling in."

(Source: )

Why the semantic matters

Operators and scheduling plans can rely on which pod will be removed next. An operator orchestrating a graceful drain knows the target pod by arithmetic, not by inspection; it can pre-drain the specific pod instead of racing a general "whichever pod the controller picks."

The coupling problem

The semantic becomes a liability when the highest-ordinal pod is not necessarily the "best" pod to remove. StatefulSet's removal order is a function of ordinal, not of the pod's current role, load, zone, or readiness. Three concrete couplings:

  • Zone-aware workloads: the highest-ordinal pod may be the only pod in one AZ. Removing it violates zone-spread; drains get stuck. Zalando Lounge 2024-06-20 is the canonical wiki instance.
  • Leader-follower workloads: if statefulset-0 is the current leader and statefulset-N is the most recent follower, removing the follower is correct and the semantic serves you. If a leadership change has flipped statefulset-N to leader (unusual but possible), removing the leader is destructive and the semantic hurts.
  • Load-hottest-last: if workload distribution has made statefulset-N the currently hottest pod, removing it triggers the largest redistribution. Not wrong per se, just possibly not the optimal eviction.

Alternatives

  • Pod-level anti-affinity + per-zone floors at the scheduling layer, so any ordinal is equally safe to remove.
  • Custom operator that selects eviction candidates by criteria other than ordinalPlanetScale's Vitess Operator runs plain pods instead of StatefulSets specifically so that the operator can pick candidates by Vitess-aware logic; the StatefulSet ordinal semantic is given up intentionally because it was the wrong abstraction.
  • StatefulSet podManagementPolicy: Parallel changes startup/shutdown ordering but not the scale-in-picks-highest-ordinal rule.
  • StatefulSet updateStrategy is for rolling updates, not scale-in — it's orthogonal to this semantic.

Interaction with ephemeral storage

The highest-ordinal semantic works well when ordinals are stably bound to something useful (e.g. an EBS volume pinned to an AZ). Under ephemeral storage with cross-zone drift, ordinals carry no such semantic guarantee, and the deterministic "highest first" rule picks an arbitrary pod that may happen to be structurally wrong to remove.

Seen in

  • — canonical wiki instance. es-data-production-v2-6, the highest-ordinal pod, was the only pod in eu-central-1a; StatefulSet scale-in targeted it; shard-allocation awareness refused to move its shards; the drain got stuck.
Last updated · 542 distilled / 1,571 read