CONCEPT

Shard allocation awareness¶

Definition¶

Shard allocation awareness is Elasticsearch's mechanism for telling the cluster about a physical distribution attribute of each node — typically availability zone, rack, or host — and instructing the shard allocator to spread each index's primaries and replicas across distinct attribute values. The cluster refuses to place two copies of the same shard on nodes sharing an attribute value; if doing so is the only option, the allocator leaves shards unassigned rather than violating the invariant.

Documented at Elasticsearch: shard-allocation-awareness.

Purpose¶

Bound blast radius: losing one AZ (or rack, or host) removes at most one copy of each shard, so the cluster stays available and search results stay complete as long as replicas elsewhere survive. Canonical operational reason to enable it on any multi-AZ Elasticsearch deployment.

The invariant cuts both ways¶

The same invariant that protects you during zone failure refuses to help you move shards out of the last node in a zone during a drain. If a node is alone in its zone and the operator asks Elasticsearch to relocate its shards elsewhere, Elasticsearch declines — relocating those shards would put two copies of the same shard in a single remaining zone, violating the zone-spread constraint.

Canonical wiki instance: Zalando Lounge's 2024-06-20 incident, where the es-operator drain got stuck on es-data-production-v2-6 — the only pod in eu-central-1a:

"Here though, the node to be drained is the only one located in eu-central-1a. Due to our zone awareness configuration, Elasticsearch refused to relocate the shards in it."

(Source: )

The drain retry loop then consumes its full 999-attempt budget making no progress, until the next morning's scale-out adds pods back and unsticks the situation by coincidence. See concepts/zone-aware-shard-allocation-stuck-drain for the specific failure-mode write-up.

Implication for scale-in planning¶

Schedule-based scale-in plans must keep at least one pod per zone after scale-in, or — more conservatively — at least two pods per zone, so that the highest-ordinal pod slated for removal is never the sole pod in its zone. Zalando Lounge's "quick fix" after morning 1 was exactly this: bump the nightly floor so the next pod to remove was no longer alone in its zone.

The planning gets harder when pod-to-zone assignment isn't stable (e.g. ephemeral storage rather than zone-pinned EBS volumes — see concepts/ephemeral-storage-cross-zone-drift). Under stable assignment (each StatefulSet ordinal bound to a zone by an EBS volume), the planner can know a priori which ordinal is alone in which zone. Under unstable assignment, a Kubernetes control-plane upgrade can re-spread pods and invalidate the plan — which is exactly what happened at Zalando Lounge when the cluster was upgraded to K8s 1.28 the day before the first incident.

Interaction with ephemeral storage¶

The tension: ephemeral storage gives pods freedom to move across zones (zone rebalancing is allowed); shard-allocation awareness wants pods not to move across zones once shards are placed. The two assumptions collide at drain time — the mobility property that justifies ephemeral storage is the same property that makes the "last pod in zone" set unpredictable between scale-in events. See concepts/ephemeral-storage-cross-zone-drift.

Seen in¶

— canonical wiki instance. Zone-spread invariant on a 3-AZ cluster refuses the drain of the only pod in one AZ; combined with an operator retry loop that ignores ctx.Done and zombie exclusion-list state, produced three consecutive morning "too-few-nodes" alerts at Zalando Lounge.

systems/elasticsearch — the system that implements shard allocation awareness.
systems/es-operator — the Zalando operator whose drain hit this invariant.
concepts/availability-zone-balance — the broader scheduling invariant; shard-allocation awareness is ES's enforcement of it at shard granularity.
concepts/zone-aware-shard-allocation-stuck-drain — the specific failure-mode write-up.
concepts/ephemeral-storage-cross-zone-drift — the pod-mobility property that makes "last pod in zone" time-varying.