ZALANDO 2024-06-20 Tier 2

Failing to Auto Scale Elasticsearch in Kubernetes¶

Summary¶

Zalando's Lounge team runs an Elasticsearch cluster on Kubernetes to serve user-facing article descriptions under a ~3× morning traffic spike. They use cron-based scheduled scaling (separate cronjobs for scale-out and scale-in) that manipulates an ElasticsearchDataSet (EDS) custom resource managed by their open-source es-operator. Over three consecutive mornings they hit the same alert — "too few running Elasticsearch nodes" — tripped by the same class of bug: the nightly scale-in got stuck trying to drain the last remaining pod in one AWS availability zone, tripped up by shard-allocation awareness refusing to violate the zone-spread invariant. Two distinct es-operator bugs were uncovered: (1) the drain retry loop ignored context cancellation, so when the scale-out EDS update arrived in the morning, es-operator was supposed to abort the in-flight scale-in and reconcile toward the new desired state but instead kept retrying the drain; (2) when a drain attempt is interrupted before the cleanup phase, the pod's IP remains in Elasticsearch's cluster.routing.allocation.exclude._ip list forever — a zombie exclusion-list entry — causing the next morning's drain to misidentify which pod is the sole remaining one in the zone. A Kubernetes 1.28 upgrade on the day before the first incident changed pod scheduling across zones, which turned a latent bug into a real incident. Two fixes: a merged retry-loop fix for the context-cancellation bug, and a WIP PR for the cleanup-on-interruption bug. Embarrassing coda: a third morning alert from a separate experimental scale-down cronjob that had been missed in the "quick fix" sweep — an organizational-scope bug rather than a code bug. Closing lesson: "Read the code. For solving difficult problems, understanding the related processes in abstract terms might not be enough."

Key takeaways¶

Scheduled-cron scaling + zone-aware shards can deadlock at the boundary. Lounge runs 6 pods during the night, scales out to 7+ in the morning. The StatefulSet removes the highest-ordinal pod on scale-in. When that pod happens to be the only one in an AZ, Elasticsearch's zone-aware shard-allocation refuses to relocate its shards (doing so would violate the per-zone spread invariant), and the drain hangs. "Es-operator has quite simple logic here: It requests for shards to be relocated, check whether it happened and keep retrying for 999 times before giving up." (Source: this page)
Context cancellation is a correctness property, not a nice-to-have. The intended es-operator contract: "If, during that process, EDS gets changed one more time, es-operator should abort the process and start modifying to cluster to match the new desired state." The actual bug: "in this one specific retry loop, context cancellations are not reacted on." Scale-out for the morning could not preempt a stuck scale-in from the previous night — because one retry loop was deaf to ctx.Done. Fix: es-operator PR #405. (Source: this page)
A Kubernetes control-plane upgrade can change pod-to-zone distribution. "on Monday, the day before the first anomaly, our Kubernetes cluster was upgraded to version 1.28. This process likely has affected the pod scheduling across nodes in a different availability zone" — the upgrade's reschedules produced the uneven zone distribution that put a single pod alone in eu-central-1a. The authors note they did not do a full deep-dive to confirm the mechanism. (Source: this page)
Ephemeral storage + zone-spread invariant is a non-guarantee. "If that StatefulSet was using an EBS backed volume, Kubernetes would guarantee to not move them between zones. We, however, don't store unrecoverable data in our Elasticsearch, thus we can afford to run it on top of ephemeral storage. Nothing is strictly guaranteed for us then." Trading durability for cost buys pod re-balance freedom, which becomes a liability when drains assume stable zone membership. (Source: this page)
Interrupted multi-step state-mutation leaves zombie state. The es-operator drain protocol: (a) mark pod excluded via cluster.routing.allocation.exclude._ip, (b) wait for shards to relocate out, (c) remove the pod, (d) clean up the exclusion list. If the process is interrupted after (a) but before (d), the IP stays excluded forever; the next drain cycle misreads cluster state — a pod that looks eligible to hold shards is actually ignored by Elasticsearch. "es-data-production-v2-6, which failed to scale in the day before, was still marked as excluded and Elasticsearch was unwilling to store any data in it. In effect, es-data-production-v2-7 was the only usable node in eu-central-1a." (Source: this page)
Adding an if-clause is not a fix for partial-failure bugs. "Just adding a special if clause for cleaning up in case of cancellation would solve the simple instance of this problem. But we are potentially dealing with partial failure here. Any amount of if clauses wouldn't solve the problem when the es-operator crashes in the middle of the draining process." The correct fix pattern is a cleanup phase that runs on reconcile, not on interruption signal — the operator pattern's own idempotent-convergence primitive. A WIP PR was open at publish time. (Source: this page)
The org-level blast radius bug is often the last bug. The "quick fix" (increase nightly floor by 1 pod) touched the main scale-down cronjob but missed an experimental project's scale-down cronjob. Morning 3 alert fired from the forgotten cronjob. Lesson: an enumerated list of schedule-based scaling triggers is an organizational artifact, not a code artifact — bugs there are caught by ops hygiene, not by compilers. (Source: this page)
"Read the code" as the load-bearing lesson. Closing sentence: "Read the code. For solving difficult problems, understanding the related processes in abstract terms might not be enough. The details matter, and the code is the final documentation for those. It also mercilessly reveals any bugs that lurk around." The intended es-operator reconcile semantics (abort-on-spec-change, cleanup-on-completion) were correct in principle; the bugs were hidden in one retry loop's cancellation handling and one drain path's missing idempotent cleanup. No amount of abstract reasoning about the intended model would have surfaced either bug. (Source: this page)

Architecture¶

Workload: Lounge Elasticsearch cluster, user-facing article descriptions, 6 nodes at night, 7+ in the morning, 3× morning traffic spike.
Deployment: es-operator manages a custom resource ElasticsearchDataSet (EDS); operator materialises EDS as a Kubernetes StatefulSet; pods spread across AWS AZs (eu-central-1a, eu-central-1b, eu-central-1c).
Storage: ephemeral (no EBS), trading durability guarantees for zone-rebalance freedom.
Scaling: schedule-based, via a "fairly complex set of cronjobs that change the number of nodes by manipulating the EDS for our cluster. There's separate cronjobs for scaling up at various times of day and scaling down at other times of day."
Shard placement: Elasticsearch's zone-aware shard-allocation enabled — shards spread across AZs, allocation refuses to violate that invariant.
StatefulSet scale-in semantics: highest-ordinal pod removed first (es-data-production-v2-6, then v2-5, …).

Pod-to-zone distribution at the first anomaly:

es-data-production-v2-0 eu-central-1b
es-data-production-v2-1 eu-central-1c
es-data-production-v2-2 eu-central-1b
es-data-production-v2-3 eu-central-1c
es-data-production-v2-4 eu-central-1c
es-data-production-v2-5 eu-central-1c
es-data-production-v2-6 eu-central-1a   ← alone in zone, next to scale in

Es-operator drain protocol (as uncovered)¶

Mark pod excluded: set cluster.routing.allocation.exclude._ip to include the pod's IP.
Poll Elasticsearch: are shards still on this pod?
If yes, go to (2) (up to 999 retries).
Remove the pod from the StatefulSet.
Clean up: remove the pod's IP from cluster.routing.allocation.exclude._ip.

Bug 1 (retry loop at step 2/3): ctx.Done not observed, so a new EDS update arriving during the loop does not abort the drain.

Bug 2 (cleanup phase at step 5): the protocol assumes steps (1) → (5) run to completion. If the process is interrupted after step 1 but before step 5 — by cancellation, crash, or es-operator restart — the exclusion list is never cleaned up. On the next drain cycle, the stale exclusion causes Elasticsearch to treat an apparently-healthy pod as unavailable.

Operational numbers / scale¶

Cluster size: 6 nodes at night, 7+ in the morning.
Traffic: ~3× normal load during morning busy hour.
Retry loop: "keep retrying for 999 times before giving up" — on drain.
Timing of accidental recovery: es-operator retries continued through the night; the morning scale-out EDS update freed the drain two minutes after the on-call alert, by coincidence.
K8s upgrade: 1.28, day before the first anomaly.
Incident morning count: 3 consecutive mornings, same alert.

Caveats / gaps¶

K8s 1.28 upgrade root cause not confirmed. Authors state: "we have not done a full deep dive into the upgrade process to confirm this." So the link between the upgrade and the zone-distribution change is plausible-but-unverified.
No numbers on data volume, shard count, or query QPS — the incident shape is the subject, not capacity planning.
Bug 2 (cleanup-on-interruption) unresolved at publish time. PR #423 is marked "in progress"; Zalando currently accepts manual remediation for the zombie-exclusion-list case.
No discussion of alternative remediations — e.g. running on EBS to pin pods to zones, or adding a reconcile-on-startup pass that cleans up stale exclusion entries. Zalando's chosen path is to fix es-operator code plus add a floor node per AZ.
Third-morning cronjob missed by quick fix — the authors call this out explicitly as "embarrassing" and a "trivial mistake, but enough to cause a bit of organisational hassle." No post-mortem of why the experimental cronjob was forgotten (ownership drift, discovery tooling, documentation).

Source¶

Original: https://engineering.zalando.com/posts/2024/06/failing-to-auto-scale-elasticsearch-in-kubernetes.html
Raw markdown: raw/zalando/2024-06-20-failing-to-auto-scale-elasticsearch-in-kubernetes-1c931e7e.md
es-operator: https://github.com/zalando-incubator/es-operator
PR #405 (ctx-cancellation fix): https://github.com/zalando-incubator/es-operator/pull/405
PR #423 (cleanup-on-interruption, WIP): https://github.com/zalando-incubator/es-operator/pull/423

systems/elasticsearch — the distributed search engine whose zone-aware shard-allocation refuses to violate per-zone spread and whose cluster.routing.allocation.exclude._ip is the drain primitive.
systems/kubernetes — the orchestrator.
systems/es-operator — the Zalando-incubator operator the incident implicates.
systems/aws-ebs — contrasted as the "would have avoided this" storage choice.
concepts/shard-allocation-awareness — the Elasticsearch invariant that makes drains refuse to move shards out of a solo-in-zone pod.
concepts/zone-aware-shard-allocation-stuck-drain — the specific failure mode of drains on the last pod in a zone.
concepts/context-cancellation-ignored-in-retry-loop — the concurrency bug class the first es-operator bug is an instance of.
concepts/zombie-exclusion-list-state — the stale-state failure mode left by an interrupted drain.
concepts/ephemeral-storage-cross-zone-drift — the pod-mobility property that ephemeral storage enables and EBS forbids.
concepts/operator-reconcile-abort-on-spec-change — the intended operator contract the retry-loop bug violates.
concepts/statefulset-highest-ordinal-scale-in — the K8s primitive that determines which pod gets drained next.
concepts/elasticsearchdataset-eds — the CRD es-operator watches.
concepts/availability-zone-balance — the scheduling invariant drained-in-sequence pods perturb.
concepts/container-ephemerality — the platform property Lounge relies on.
concepts/kubernetes-operator-pattern — the superclass that es-operator instantiates.
patterns/scheduled-cron-based-scaling — the schedule-based autoscaling shape Lounge uses.
patterns/cleanup-phase-survives-interruption — the idempotent-cleanup discipline Bug 2 demonstrates the absence of.
patterns/read-the-code-for-partial-failure-bugs — the debugging-posture pattern the post closes with.
patterns/custom-operator-over-statefulset — es-operator's architectural shape (custom CRD + StatefulSet underneath).
companies/zalando