Skip to content

CONCEPT Cited by 1 source

Ephemeral storage cross-zone drift

Definition

Ephemeral storage cross-zone drift is the property that a StatefulSet pod on ephemeral storage — emptyDir, local-instance SSDs, ephemeral-storage volume — is not pinned to any availability zone by its volume and can therefore be rescheduled to a different AZ on pod restart, node drain, node replacement, or Kubernetes control-plane upgrade. In contrast, a StatefulSet pod backed by a zone-scoped block volume (AWS EBS) is zone-pinned: Kubernetes guarantees that the pod re-attaches in the same zone as its volume, because cross-zone attachment is physically impossible.

Drift makes "pod X is in zone Y" a time-varying property rather than a construction property.

Canonical wiki framing

From the Zalando Lounge 2024-06-20 post:

"If that StatefulSet was using an EBS backed volume, Kubernetes would guarantee to not move them between zones. We, however, don't store unrecoverable data in our Elasticsearch, thus we can afford to run it on top of ephemeral storage. Nothing is strictly guaranteed for us then. Normally, pods remain quite stable in a zone nevertheless, but on Monday, the day before the first anomaly, our Kubernetes cluster was upgraded to version 1.28. This process likely has affected the pod scheduling across nodes in a different availability zone, though we have not done a full deep dive into the upgrade process to confirm this."

(Source: sources/2024-06-20-zalando-failing-to-auto-scale-elasticsearch-in-kubernetes)

Lounge's Elasticsearch cluster holds recoverable data (article descriptions sourced from upstream systems); losing a pod's local disk just triggers a shard rebuild from peers. This is the case for which ephemeral storage makes economic sense — you don't pay for persistent block storage when you don't need its durability. But the cost of that choice is pod-to-zone drift.

Why drift matters for zone-aware operations

Two consecutive observations at Lounge:

  • Pre-K8s-upgrade: pods were "quite stable in a zone." The nightly scale-in plan was implicitly calibrated against the stable distribution.
  • Post-K8s-1.28-upgrade: the upgrade process reshuffled pod placement; the new distribution had exactly one pod alone in eu-central-1a, reproducing the stuck-drain failure mode.

The scaling plan was implicitly dependent on a stable distribution it didn't make explicit. Drift broke the plan. Under zone-pinned EBS, this class of drift doesn't happen — the pod-to-zone map is fixed at volume-provisioning time. Under ephemeral storage, the map is scheduler-decided and can change on any reschedule event.

The tradeoff surface

Storage Zone-pinned Data durable Cost Drain ergonomics
EBS Yes Yes Pay per GB-month + IOPS Pod always returns to same zone; "last pod in zone" set is stable
Local NVMe Partially (instance lives in one zone, but pod can be rescheduled to any instance) No In the instance price Re-spreads across AZs on reschedule
emptyDir / ephemeral-storage No No Free-ish Drift on every reschedule

The choice is load-bearing for schedule-based scale-in on zone-aware workloads. If pods can drift across zones, the scale-in plan must be written defensively — either against the worst-case pod distribution, or via dynamic computation of "which pod is safe to drain next" rather than reliance on StatefulSet ordinal.

Operational mitigations under drift

  • Per-zone floor enforced dynamically, not by construction. Before draining ordinal N, the operator checks: "if I drop this pod, does any zone go to 0?" Refuses if yes.
  • topologySpreadConstraints / pod anti-affinity: force even spread across zones via Kubernetes scheduling rules rather than hoping the cluster autoscaler does the right thing. Effective but adds scheduling complexity.
  • Drain-candidate selection by zone safety, not ordinal: the scale-in operation picks the next pod to remove by checking "which removal preserves per-zone floor," rather than always taking the highest ordinal. Diverges from StatefulSet's highest-ordinal semantic, so it requires a custom operator logic layer.

Zalando Lounge's post-incident mitigation was the simplest of these: increase the nightly floor by 1 so that even under worst-case drift, no zone goes to 0.

Relation to container ephemerality

concepts/container-ephemerality is the container-level property: local state inside a container doesn't survive the container. This page is the scheduler-level consequence: when the storage doesn't pin the pod to infrastructure, the scheduler is free to reshape the pod-to-zone map.

Seen in

  • sources/2024-06-20-zalando-failing-to-auto-scale-elasticsearch-in-kubernetes — canonical wiki instance. Lounge chose ephemeral storage for its Elasticsearch cluster (recoverable data, cheaper); a K8s 1.28 upgrade drifted the pod-to-zone distribution; the next night's scale-in picked the pod that was now alone in eu-central-1a, triggering a stuck drain. Verbatim acknowledgement of the tradeoff: "Nothing is strictly guaranteed for us then."
Last updated · 501 distilled / 1,218 read