SYSTEM

es-operator¶

es-operator is Zalando's open-source Kubernetes operator for running Elasticsearch on Kubernetes. It defines an ElasticsearchDataSet (EDS) Custom Resource, watches for changes, and reconciles the cluster state by managing a Kubernetes StatefulSet (pods + PersistentVolumes) underneath.

Source: https://github.com/zalando-incubator/es-operator.

Shape¶

Kubernetes Operator pattern on top of a StatefulSet:

CRD: ElasticsearchDataSet — describes desired Elasticsearch cluster shape (replicas, resource requests, config).
Reconciler: watches EDS; materialises a StatefulSet; handles scale-in by draining nodes (relocate shards → remove pod → clean up Elasticsearch's exclusion list) and scale-out by extending the StatefulSet.
Intended contract: "if, during that process, EDS gets changed one more time, es-operator should abort the process and start modifying to cluster to match the new desired state." (Source: )

Contrast with PlanetScale's Vitess Operator (plain pods + direct PVC, no StatefulSet): es-operator uses the StatefulSet abstraction rather than replacing it. The StatefulSet provides stable pod names and ordered scale-in (highest ordinal removed first); es-operator layers Elasticsearch-aware drain orchestration on top.

Drain protocol (uncovered from the 2024-06-20 incident)¶

Mark pod excluded: set cluster.routing.allocation.exclude._ip to include the pod's IP.
Poll Elasticsearch: are shards still on this pod?
If yes, go to (2) — up to 999 retries.
Remove the pod from the StatefulSet.
Clean up: remove the pod's IP from cluster.routing.allocation.exclude._ip.

Two incident bugs (2024-06)¶

The disclosed two bugs that three consecutive morning alerts uncovered:

Bug 1 — ctx-cancellation ignored in the retry loop at step 2/3. When a new EDS update arrives (e.g. the morning scale-out while the nightly scale-in is still retrying), the operator should abort and reconcile toward the new desired state. Instead it kept retrying the drain. See concepts/context-cancellation-ignored-in-retry-loop. Fixed in PR #405.
Bug 2 — cleanup phase is not idempotent under interruption. If the drain is interrupted between step 1 and step 5 (by cancellation or es-operator crash), the pod's IP stays in Elasticsearch's exclusion list forever — a zombie exclusion-list entry. The next drain misreads cluster state and picks the wrong "last pod in zone." WIP fix in PR #423; at publish time Zalando accepts manual remediation.

The bugs compose: Bug 1 causes an interrupted drain, Bug 2 leaves the exclusion list dirty, and the next night's drain blows up on different shape. The post is explicit that a point-fix ("just adding a special if clause for cleaning up in case of cancellation") would not suffice because "we are potentially dealing with partial failure here. Any amount of if clauses wouldn't solve the problem when the es-operator crashes in the middle of the draining process." The correct fix is an idempotent cleanup on reconcile rather than a cancel-branch handler — the operator pattern's own convergence primitive. See patterns/cleanup-phase-survives-interruption.

Interaction with scheduled-cron scaling¶

es-operator is the materialisation layer under Zalando Lounge's schedule-based scaling setup: cronjobs mutate the EDS to change replicas; es-operator reconciles. The interaction pattern is specifically what exposed Bug 1 — two EDS updates in flight at once (the nightly scale-in that got stuck, and the morning scale-out arriving while it was still retrying).

Interaction with zone-aware shard-allocation¶

The Lounge cluster is configured with Elasticsearch's shard-allocation awareness across three AZs. When the next pod to drain happens to be the only one in an AZ, Elasticsearch refuses to relocate its shards (doing so would violate zone-spread). es-operator's drain retry loop then cannot make forward progress for the full 999 attempts — which is how the nightly scale-in got stuck. This is an inherent tension rather than an es-operator bug: the operator can't force a shard move Elasticsearch refuses. The mitigation is at the scaling-plan level (set the nightly floor to strictly more than one pod per AZ).

Seen in¶

— canonical wiki instance. Three consecutive morning alerts trace back to two es-operator bugs (ctx-cancellation, zombie exclusion-list) exposed by the interaction between schedule-based scaling, zone-aware shard-allocation, and a K8s 1.28 upgrade that had just perturbed pod-to-zone distribution.

systems/elasticsearch — the managed workload.
systems/kubernetes — the orchestrator.
concepts/kubernetes-operator-pattern — the abstract pattern es-operator instantiates.
concepts/elasticsearchdataset-eds — the CRD.
concepts/shard-allocation-awareness — the Elasticsearch invariant the drain runs into.
concepts/context-cancellation-ignored-in-retry-loop — Bug 1.
concepts/zombie-exclusion-list-state — Bug 2.
concepts/operator-reconcile-abort-on-spec-change — the intended contract Bug 1 violates.
patterns/custom-operator-over-statefulset — contrast pattern (Vitess Operator replaces StatefulSets; es-operator uses them).
patterns/scheduled-cron-based-scaling — the scaling discipline above es-operator.
patterns/cleanup-phase-survives-interruption — the discipline Bug 2 demonstrates the absence of.
companies/zalando