CONCEPT Cited by 1 source

Operator reconcile abort on spec change¶

Definition¶

The reconcile-abort-on-spec-change contract is the behavioural property of a well-written Kubernetes operator that, when a CRD instance's desired state changes mid-reconcile, the in-flight reconcile aborts and a new reconcile begins against the new desired state — rather than completing the old reconcile's trajectory against now-obsolete intent.

The contract is the responsiveness cousin of reconciler idempotence: without it, the operator's intended-state-driven model degrades into "whatever was asked for 5 minutes ago is what's happening right now," producing latency between spec changes and observable cluster shape.

Canonical wiki framing¶

From the Zalando Lounge 2024-06-20 post (verbatim):

"The intended behaviour of es-operator is as follows: It constantly monitors updates to EDS resources and if change is observed, it compares the state of the cluster to the description and starts to modify the cluster to match its description. If, during that process, EDS gets changed one more time, es-operator should abort the process and start modifying the cluster to match the new desired state."

(Source: sources/2024-06-20-zalando-failing-to-auto-scale-elasticsearch-in-kubernetes)

The Zalando Lounge incident was precisely this contract failing to hold in practice — one retry loop deep inside the drain path did not observe ctx-cancellation, so a new EDS update (the morning scale-out) could not preempt the stuck scale-in drain. See concepts/context-cancellation-ignored-in-retry-loop for the bug-class anatomy.

Implementation anatomy¶

The contract needs three cooperating pieces:

Watch-based spec observation. The operator's control loop uses watch (not polling) on the CRD, so spec changes arrive promptly as events.
Cancellable reconcile context. Each reconcile run receives a context.Context (Go) or equivalent; when a new spec arrives, the outer reconciler cancels the in-flight context and starts a fresh reconcile.
Cancellation-aware loop bodies throughout. Every loop, wait, retry, and poll in the reconcile must check the context's cancellation before proceeding. This is where the contract breaks in practice — most operators honour cancellation at the top level but have sub-loops that silently ignore it.

The controller-runtime library provides pieces (1) and (2) via Reconcile(ctx, req). Piece (3) is the author's responsibility on a per-loop basis — hence why individual-loop bugs like the es-operator one are common.

Why the contract is load-bearing¶

Kubernetes operators are expected to close the "spec change → actual state change" latency gap. Applications making EDS changes (e.g. Zalando's cronjob manipulating the replicas field) assume the operator will start working on the new shape within seconds of the change, not complete-the-old-plan-then-start-the-new-plan. That assumption is especially load-bearing when the two specs are in tension — a stuck scale-in that the morning scale-out is trying to replace cannot be left to "finish" because it will never finish.

In Zalando's case: the nightly scale-in was stuck on shard-allocation awareness with no hope of completion. Only preemption (the abort contract) could break the livelock. The broken contract meant preemption didn't happen.

Contract, but not idempotence¶

Abort-on-spec-change is about preemption, not about idempotence. An operator that preempts cleanly can still leave bad intermediate state if its cleanup path isn't idempotent — which is exactly the shape of the Zalando Lounge second bug: even if ctx-cancellation were honoured perfectly, the cleanup-on-interruption bug would still produce zombie exclusion-list entries every time a drain got aborted.

Proper operator design requires both:

Abort-on-spec-change (this concept), so reconciles are responsive.
Cleanup that survives interruption, so aborts don't leave state debt.

Together they produce the "spec is the source of truth; the operator converges quickly and safely" promise that is the whole value of the operator pattern.

Seen in¶

sources/2024-06-20-zalando-failing-to-auto-scale-elasticsearch-in-kubernetes — canonical wiki framing. The intended contract is stated verbatim; the incident is a contract-violation case study (one retry loop ignored ctx-cancellation, so a new EDS update could not preempt the stuck drain).

systems/es-operator
concepts/kubernetes-operator-pattern
concepts/context-cancellation-ignored-in-retry-loop — the specific bug class that breaks the contract.
concepts/zombie-exclusion-list-state — the partial-failure consequence when aborts happen mid-mutation without idempotent cleanup.
patterns/cleanup-phase-survives-interruption — the complementary discipline.