Skip to content

PATTERN Cited by 1 source

Cleanup phase survives interruption

Intent

Design a multi-step state-mutation — acquire → act → release, mark → act → unmark, open → act → commit — so that the release / unmark / commit step is executed regardless of how the middle step was interrupted: not only on the success path, not only in a graceful cancel handler, but also on crash, OOM-kill, restart, network partition, and ungraceful process exit.

Equivalently: no cleanup step that matters for correctness should live in a success-path branch or a cancel-branch handler. It must live in a place that runs independently of how the previous step ended. For operators, that place is reconcile.

When to use

The pattern is mandatory when all three of:

  • A multi-step mutation has a mark (or lock, or exclusion) step whose state is visible to other processes via a shared coordinator (Kubernetes API, Elasticsearch cluster state, database row, lock service).
  • Interruption between "mark" and "clean up" leaves the coordinator's state in a configuration that affects subsequent operations' behaviour (a stale lock blocks other writers; a stale exclusion list misroutes traffic; a stale "processing" flag prevents retry).
  • Interruption can happen by a mode your cancel-branch handler does not cover (crash, deploy, OOM — all of which kill the process without running any defered cleanup).

The pattern is not needed when the interruption's effect is self-healing (the only consumer of the stale state is the same process, which re-reads at boot) or when the "mark" has a natural TTL mechanism built in (lease-based locks, sliding-window throttle counters).

Why "add a cancel-branch" isn't enough

The canonical anti-fix, from the Zalando Lounge 2024-06-20 post (verbatim):

"Just adding a special if clause for cleaning up in case of cancellation would solve the simple instance of this problem. But we are potentially dealing with partial failure here. Any amount of if clauses wouldn't solve the problem when the es-operator crashes in the middle of the draining process."

(Source: sources/2024-06-20-zalando-failing-to-auto-scale-elasticsearch-in-kubernetes)

A cancel-handler only covers the case where cancellation runs its cleanup code. Crashes, kernel-OOM-kills, forced deploys, and power loss all bypass every handler the caller has. Partial-failure correctness must be asserted at a higher level than the interrupted code itself.

Implementation patterns

Three load-bearing shapes:

1. Reconcile-based idempotent cleanup (Kubernetes-operator-native)

Every reconcile tick: 1. Read the shared coordinator's current state (e.g. cluster.routing.allocation.exclude._ip list). 2. Read the declared intent (the CRD spec + what drains are actually in progress). 3. Compute the diff: entries in the coordinator that correspond to no in-flight drain are zombies — remove them. 4. Compute the additions: in-flight drains whose target pod is not in the coordinator's list — add them. 5. Execute the diff.

The operator doesn't need to know how the previous cycle ended. It just needs to make the coordinator's state match reality. This is the mechanism Kubernetes operators are literally designed for.

2. Leased / TTL'd marks

The "mark" step writes a value with an expiration. If cleanup doesn't happen before the lease expires, the mark auto-evicts. Requires (a) the coordinator to support TTLs natively (etcd, Redis, DynamoDB TTL) or (b) a companion sweeper that polls ages and evicts old marks.

  • Distributed locks (etcd lock, Consul session, ZooKeeper ephemeral znode) — all use this shape.
  • Throttle / rate-limit counters — sliding window evicts naturally.

3. Transactional fencing

The "mark" and "act" run in a single transaction, so either both happen or neither does. The cleanup doesn't need to be separately interruptible because there's nothing to clean up if act didn't commit. Only applicable when the coordinator supports transactions over the whole mark+act scope, which Kubernetes / Elasticsearch cluster state / most shared coordinators do not.

Canonical wiki instance

Zalando Lounge's es-operator drain protocol, 2024-06-20:

  • The intended protocol has steps (1) mark pod excluded in cluster.routing.allocation.exclude._ip, (2) wait for shards to relocate out, (3) remove pod, (4) clean up exclusion list.
  • Step (4) is only executed on the happy path. If the process is interrupted between (1) and (4), the exclusion list entry becomes a zombie — forever excluded, blocking subsequent drains from using that pod for shard placement.
  • The fix is not to add step (4) to a cancel handler but to run the cleanup on every reconcile tick: cross-reference exclusion-list entries against currently-draining pods and remove stale entries. See WIP PR #423. At publish time, Zalando accepted manual cleanup.

Generalisation — where this pattern is missing in the wild

The pattern is routinely absent in code bases where multi-step protocols were implemented with the "happy path + optional cancel handler" pattern common in early Go / Python async codebases:

  • Database is_processing = true flags that strand rows when the processor crashes. Fix: TTL on the flag or a periodic sweeper.
  • Service-mesh circuit breakers that open on failure and rely on success to close; if recovery path panics, breaker stays open. Fix: time-bounded open → half-open.
  • Load-balancer drain lists that mark an instance "out of rotation" and rely on graceful shutdown to un-mark. Fix: health-check-driven removal, not shutdown-driven.
  • Distributed-lock ownership rows that rely on release-on-unlock and leak on crash. Fix: leases.

Every case is recognisable by the question "if the caller dies immediately after the mark step, does the mark eventually evaporate?" If the answer is "only if another caller calls the unmark code," the pattern is missing.

Detection

  • Leftover-state metrics — size and age of the coordinator's exclusion / lock / flag list, with alerts on entries older than the longest expected transaction.
  • Reconciler warnings — the reconcile loop itself logs or alerts on "I found state that shouldn't exist."
  • Audit tools — SRE-visible tool to list and manually clean up zombie entries as a stopgap before the pattern is implemented.

Relation to the reconcile-abort-on-spec-change contract

This pattern is orthogonal-but-complementary to concepts/operator-reconcile-abort-on-spec-change. Abort-on-spec-change makes reconciles responsive (preempt old work for new work). Cleanup-phase-survives-interruption makes aborts safe (the preempted work doesn't leave debt). An operator needs both — responsiveness without safety produces state debt on every abort; safety without responsiveness produces user-facing latency for every spec change.

Seen in

Last updated · 501 distilled / 1,218 read