CONCEPT

Ephemeral storage cross-zone drift¶

Definition¶

Ephemeral storage cross-zone drift is the property that a StatefulSet pod on ephemeral storage — emptyDir, local-instance SSDs, ephemeral-storage volume — is not pinned to any availability zone by its volume and can therefore be rescheduled to a different AZ on pod restart, node drain, node replacement, or Kubernetes control-plane upgrade. In contrast, a StatefulSet pod backed by a zone-scoped block volume (AWS EBS) is zone-pinned: Kubernetes guarantees that the pod re-attaches in the same zone as its volume, because cross-zone attachment is physically impossible.

Drift makes "pod X is in zone Y" a time-varying property rather than a construction property.

Canonical wiki framing¶

From the :

"If that StatefulSet was using an EBS backed volume, Kubernetes would guarantee to not move them between zones. We, however, don't store unrecoverable data in our Elasticsearch, thus we can afford to run it on top of ephemeral storage. Nothing is strictly guaranteed for us then. Normally, pods remain quite stable in a zone nevertheless, but on Monday, the day before the first anomaly, our Kubernetes cluster was upgraded to version 1.28. This process likely has affected the pod scheduling across nodes in a different availability zone, though we have not done a full deep dive into the upgrade process to confirm this."

(Source: )

Lounge's Elasticsearch cluster holds recoverable data (article descriptions sourced from upstream systems); losing a pod's local disk just triggers a shard rebuild from peers. This is the case for which ephemeral storage makes economic sense — you don't pay for persistent block storage when you don't need its durability. But the cost of that choice is pod-to-zone drift.

Why drift matters for zone-aware operations¶

Two consecutive observations at Lounge:

Pre-K8s-upgrade: pods were "quite stable in a zone." The nightly scale-in plan was implicitly calibrated against the stable distribution.
Post-K8s-1.28-upgrade: the upgrade process reshuffled pod placement; the new distribution had exactly one pod alone in eu-central-1a, reproducing the stuck-drain failure mode.

The scaling plan was implicitly dependent on a stable distribution it didn't make explicit. Drift broke the plan. Under zone-pinned EBS, this class of drift doesn't happen — the pod-to-zone map is fixed at volume-provisioning time. Under ephemeral storage, the map is scheduler-decided and can change on any reschedule event.

The tradeoff surface¶

Storage	Zone-pinned	Data durable	Cost	Drain ergonomics
EBS	Yes	Yes	Pay per GB-month + IOPS	Pod always returns to same zone; "last pod in zone" set is stable
Local NVMe	Partially (instance lives in one zone, but pod can be rescheduled to any instance)	No	In the instance price	Re-spreads across AZs on reschedule
`emptyDir` / `ephemeral-storage`	No	No	Free-ish	Drift on every reschedule

The choice is load-bearing for schedule-based scale-in on zone-aware workloads. If pods can drift across zones, the scale-in plan must be written defensively — either against the worst-case pod distribution, or via dynamic computation of "which pod is safe to drain next" rather than reliance on StatefulSet ordinal.

Operational mitigations under drift¶

Per-zone floor enforced dynamically, not by construction. Before draining ordinal N, the operator checks: "if I drop this pod, does any zone go to 0?" Refuses if yes.
topologySpreadConstraints / pod anti-affinity: force even spread across zones via Kubernetes scheduling rules rather than hoping the cluster autoscaler does the right thing. Effective but adds scheduling complexity.
Drain-candidate selection by zone safety, not ordinal: the scale-in operation picks the next pod to remove by checking "which removal preserves per-zone floor," rather than always taking the highest ordinal. Diverges from StatefulSet's highest-ordinal semantic, so it requires a custom operator logic layer.

Zalando Lounge's post-incident mitigation was the simplest of these: increase the nightly floor by 1 so that even under worst-case drift, no zone goes to 0.

Relation to container ephemerality¶

concepts/container-ephemerality is the container-level property: local state inside a container doesn't survive the container. This page is the scheduler-level consequence: when the storage doesn't pin the pod to infrastructure, the scheduler is free to reshape the pod-to-zone map.

Seen in¶

— canonical wiki instance. Lounge chose ephemeral storage for its Elasticsearch cluster (recoverable data, cheaper); a K8s 1.28 upgrade drifted the pod-to-zone distribution; the next night's scale-in picked the pod that was now alone in eu-central-1a, triggering a stuck drain. Verbatim acknowledgement of the tradeoff: "Nothing is strictly guaranteed for us then."

systems/kubernetes — the scheduler.
systems/aws-ebs — the zone-pinned alternative.
concepts/container-ephemerality — the container-level ephemerality property; this page is its scheduler-level manifestation for volume-less StatefulSets.
concepts/availability-zone-balance — the invariant drift perturbs.
concepts/shard-allocation-awareness — the Elasticsearch invariant that amplifies drift's consequences.
concepts/zone-aware-shard-allocation-stuck-drain — the failure mode drift enabled at Zalando Lounge.