Skip to content

CONCEPT Cited by 1 source

Zone-aware shard-allocation stuck drain

Definition

The zone-aware shard-allocation stuck drain is the failure mode where an Elasticsearch node drain makes no forward progress because the node being drained is the only copy of a zone-aware attribute (typically: the only node in an availability zone) and shard-allocation awareness refuses to relocate its shards — because doing so would violate the per-zone spread invariant.

The drain enters a livelock: the orchestrator (es-operator, in the canonical wiki instance) repeatedly marks the node excluded via cluster.routing.allocation.exclude._ip and polls for shard movement; Elasticsearch repeatedly acknowledges the exclusion but refuses to move shards; the poll times out; the orchestrator retries — up to a configured retry ceiling (999 retries in es-operator). No amount of retries unsticks the situation because the constraint violation is structural, not transient.

Canonical wiki instance

Zalando Lounge, 2024-06-20. Pod-to-zone distribution on the night of the first anomaly:

es-data-production-v2-0 eu-central-1b
es-data-production-v2-1 eu-central-1c
es-data-production-v2-2 eu-central-1b
es-data-production-v2-3 eu-central-1c
es-data-production-v2-4 eu-central-1c
es-data-production-v2-5 eu-central-1c
es-data-production-v2-6 eu-central-1a   ← alone in zone, next to scale in

StatefulSet scale-in semantics picked v2-6 as the next pod to remove. It was the only pod in eu-central-1a. Elasticsearch refused to move its shards out (doing so would put two copies of shards into a single remaining zone). The drain retry loop ran all night without forward progress.

"Here though, the node to be drained is the only one located in eu-central-1a. Due to our zone awareness configuration, Elasticsearch refused to relocate the shards in it. Es-operator has quite simple logic here: It requests for shards to be relocated, check whether it happened and keep retrying for 999 times before giving up."

(Source: sources/2024-06-20-zalando-failing-to-auto-scale-elasticsearch-in-kubernetes)

Detection

Primary signals:

  • Drain duration much greater than shard-relocation typical time, on a specific pod, persistently.
  • cluster.routing.allocation.exclude._ip contains the pod's IP; yet _cat/shards still reports shards on that pod.
  • _cluster/allocation/explain on one of the stuck shards returns a zone-awareness-related "allocation decider" reason ("there are too many copies of the shard allocated to nodes with attribute zone=X").
  • Zone distribution of remaining pods: exactly one pod in the draining pod's zone.

Mitigations

Plan-level (preferred):

  • Keep a per-zone floor on the scaled-in shape — never let scale-in reduce a zone below 1 pod (or more conservatively, below 2 pods). For a 3-AZ cluster with 6-pod nightly floor, this means enforcing 2-2-2 distribution by construction (either via the scaling plan or via affinity/anti-affinity rules), not hoping for it.
  • Zalando Lounge's "first fix" was a special case of this: bump the nightly floor by one so the next pod to drain is no longer alone in its zone. See patterns/scheduled-cron-based-scaling for the broader pattern; this incident is the canonical wiki instance of its per-zone-floor constraint.

Config-level:

  • If a brief zone-spread violation is acceptable, relax shard-allocation awareness temporarily (cluster.routing.allocation.awareness.* settings). Not recommended for production because it defeats the purpose of enabling awareness in the first place.

Structural:

  • Use zone-pinned storage (EBS) so StatefulSet ordinals are bound to AZs deterministically. The planner then knows a priori which ordinal is alone in which zone and can refuse to scale in below the floor. Ephemeral storage loses this predictability — see concepts/ephemeral-storage-cross-zone-drift.

Interaction with ctx-cancellation bugs

The stuck drain per se is a plan-level bug (the scale-in floor is too aggressive) that would normally resolve itself when the morning scale-out adds pods back. The 2024-06-20 incident was worse because it composed with a ctx-cancellation bug in the retry loop: a new EDS update (morning scale-out) arrived while the loop was still retrying the stuck drain, but the loop didn't observe ctx.Done, so the operator couldn't reconcile toward the new shape until the retry budget finally exhausted or the loop's next iteration happened to succeed.

And once the drain was interrupted, zombie exclusion-list state carried the damage into the next day's scale-in cycle.

Seen in

Last updated · 501 distilled / 1,218 read