Skip to content

CONCEPT Cited by 1 source

Zombie exclusion-list state

Definition

Zombie exclusion-list state is the class of correctness bug in which a multi-step state mutation includes a mark-excluded-then-act-then-clean-up pattern, but the cleanup step is only executed on the success path. If the process is interrupted (by cancellation, crash, deploy, OOM-kill, or any mid-sequence abort) between "mark excluded" and "clean up," the excluded marker persists in shared state indefinitely — a zombie: apparently alive (still in the state store), effectively dead (no longer corresponds to a real operation).

The classic shape: a resource (pod, node, index, tenant, account) gets added to a "skip me" / "don't route to me" / "in maintenance" list in a shared coordinator, actioned upon, then expected to be removed from the list on completion. When an interruption drops the removal step, every future reader of the coordinator's state continues to treat the resource as "skip me" — but nobody knows it.

Zombie-list bugs are particularly insidious because the surface symptom is benign (the exclusion list still works correctly for the entries currently in it) and the data damage is invisible (a human running GET _cluster/settings sees an entry and can't tell which entries are live vs zombie without cross-referencing external state).

Canonical wiki instance

es-operator's drain protocol, uncovered in Zalando Lounge 2024-06-20. The drain steps:

  1. Add pod's IP to Elasticsearch's cluster.routing.allocation.exclude._ip.
  2. Poll: have the pod's shards relocated out?
  3. Remove the pod from the StatefulSet.
  4. Clean up: remove the pod's IP from exclude._ip.

When the process is interrupted between (1) and (4) — by the ctx-cancellation bug in Zalando's case, but equally by an es-operator crash or restart — step 4 is skipped. Verbatim from the post:

"If the scaling down process gets interrupted, the clean up phase is never executed and the node stays in the exclusion list forever."

(Source: sources/2024-06-20-zalando-failing-to-auto-scale-elasticsearch-in-kubernetes)

The next night's drain saw a stale exclusion list. The pod that had been previously "drained" (but whose drain was interrupted) was still marked excluded, so Elasticsearch refused to place shards there, so that pod was effectively unusable. The planner's assumption "pod X is in zone Y and can host shards" was falsified by invisible zombie state.

Observed consequence: morning 2 alert. The scale-in was from 8 → 7 pods; es-data-production-v2-7 was the next to drain; but v2-6 (drained the previous day, now zombie-excluded) was no longer functional, making v2-7 the de facto only usable pod in eu-central-1a, reproducing the stuck-drain failure mode from the night before.

The "add an if clause" anti-fix

A natural instinct upon reading the post-mortem: "just add cleanup in the cancel handler." The post is explicit this is not sufficient:

"Just adding a special if clause for cleaning up in case of cancellation would solve the simple instance of this problem. But we are potentially dealing with partial failure here. Any amount of if clauses wouldn't solve the problem when the es-operator crashes in the middle of the draining process."

(Source: sources/2024-06-20-zalando-failing-to-auto-scale-elasticsearch-in-kubernetes)

Adding a cancel-branch only covers graceful interruption. The full failure surface is partial-failure: cancellation, crash, OOM, deploy, network partition, power loss. The cleanup has to happen regardless of how the interruption occurred — which means it can't live in a cancel handler. It has to live in reconcile: on every reconciler tick, read the actual state of the world, compute the delta to desired state, and drive one toward the other — including cleaning up stale exclusion-list entries that don't correspond to any pod currently being drained.

This is the cleanup-phase-survives-interruption pattern, which is itself an instance of the Kubernetes-operator convergence discipline ("reconcile, don't choreograph").

Generalisation

Zombie-list state shows up anywhere a "mark, act, unmark" sequence is implemented as three separate writes to a shared coordinator:

  • Database maintenance flags: an app marks a row is_processing=true, does work, unmarks. Crash between mark and unmark → row stuck is_processing forever. Fix: TTL on the flag + reconciler that expires stale flags, not a cancel-branch.
  • Distributed lock leases: lock holder marks itself owner, does work, releases. Crash without release → zombie lock. Fix: lease TTL + renewal, not a cancel-branch.
  • Load-balancer drain lists: service instance marks itself out-of-rotation for graceful shutdown, then exits. Restart without unmark → zombie drain. Fix: health-check expiry on the drain list.
  • Service-mesh circuit-breaker open state: failure triggers breaker open, success closes it. If the recovery path panics before close, breaker stays open for all future requests. Fix: time-bounded open → half-open transition.
  • Shard-allocation exclusion (this page): Elasticsearch's cluster.routing.allocation.exclude._ip. Same shape.

The unifying structural fix: the "mark" must have a bounded lifetime or be owned by a reconciler that can expire it, never by the caller's success path alone. Every zombie-list bug is a reminder that "clean up in the happy path and trust cancellation in the sad path" is not partial-failure-safe.

Detection

  • Leftover-state metrics: expose the size and age of the coordinator's "exclusion list" as a metric. Zombies show up as entries older than any currently-running drain.
  • Reconciler sweeps: the operator's reconcile loop can, once per tick, cross-reference the exclusion list against the set of currently-draining pods and log a warning on mismatches.
  • Age-of-entry tracking: each list entry gets a since timestamp; operators alert on entries older than a configured threshold.
  • Periodic manual audit: in Zalando Lounge's case, "Manually removing the 'zombie' node from the exclusion list is simple" — which implies an SRE-visible tool to list exclusion-list entries and their correspondence to pods.

Interaction with ctx-cancellation

Zombie-list bugs become much more likely in codebases that also have ctx-cancellation-ignored-in-retry-loop bugs, because the common mode that triggers the zombie — mid-sequence abort — is more frequent. The two defects compose into repeated incidents: the ctx-cancellation bug causes the abort, the missing idempotent cleanup causes the zombie, and the zombie pre-loads the next drain's state with wrong assumptions.

Seen in

Last updated · 501 distilled / 1,218 read