Skip to content

CONCEPT Cited by 1 source

Context cancellation ignored in retry loop

Definition

The context-cancellation-ignored-in-retry-loop bug is a concurrency defect in which a retry loop — typically inside an operator's reconcile path, an RPC client's retry-with-backoff, or a long-poll polling loop — fails to observe its context.Context's cancellation signal (ctx.Done in Go, CancellationToken.IsCancellationRequested in .NET, AbortSignal.aborted in JS) and continues retrying past the point where the caller has signaled "abandon this work."

The bug is almost never a missing import or type error; it is a missed branch in one specific loop body of an otherwise correct codebase. The surrounding reconcile / RPC / client code usually honours cancellation correctly — it's one specific sub-loop (a drain-retry, a status-poll, a backoff-sleep) whose author forgot the cancellation check.

Canonical wiki instance

es-operator's drain-status poll loop, uncovered in the Zalando Lounge 2024-06-20 incident:

"We spent much of the next day tracing through es-operator source code and finally realised there was a bug regarding retrying on draining nodes for scaling in: In this one specific retry loop, context cancellations are not reacted on. The bug is specific to draining a node and doesn't apply to other processes."

(Source: sources/2024-06-20-zalando-failing-to-auto-scale-elasticsearch-in-kubernetes)

The es-operator contract, per the post: "if, during that process, EDS gets changed one more time, es-operator should abort the process and start modifying to cluster to match the new desired state." This is the reconcile-aborts-on-spec-change contract — a general property of well-behaved Kubernetes operators.

The violation: the drain-status poll loop ran up to 999 retries, each sleeping briefly, but its loop body did not observe ctx.Done. So when a new EDS update arrived (Zalando's morning scale-out), the surrounding reconcile could not preempt the stuck drain. Fixed in PR #405.

Structural shape

// Buggy shape (simplified):
for i := 0; i < 999; i++ {
    shardsStill, err := es.ShardsOnNode(node)
    if err == nil && shardsStill == 0 {
        break
    }
    time.Sleep(pollInterval)
}

// Correct shape:
for i := 0; i < 999; i++ {
    select {
    case <-ctx.Done():
        return ctx.Err()
    default:
    }
    shardsStill, err := es.ShardsOnNode(node)
    if err == nil && shardsStill == 0 {
        break
    }
    select {
    case <-ctx.Done():
        return ctx.Err()
    case <-time.After(pollInterval):
    }
}

The fix is boilerplate; the bug is that the boilerplate was missing in one loop body. This is a high-frequency defect class in Go codebases because context.Context propagates through call chains but using ctx for cancellation is opt-in per loop.

Why static analysis rarely catches it

go vet / golangci-lint contextcheck catches dropped contexts (calls that take a context.Context parameter but get nil / Background() passed in). It does not catch loops that hold a context but never check it — that's not a type error, it's a semantic omission. The bug is load-bearing in domain logic ("we intended to abort but didn't"), and the compiler can't know the intent.

Reliable catchers in practice:

  • Tests that assert timely cancellation"this reconcile, when given a cancelled ctx, returns within T seconds." Requires the test harness to pass a cancelled context and assert return.
  • Review discipline"if a loop iterates more than N times, does it check ctx.Done()?" as a checklist item.
  • Metrics / tracing — reconcile-duration histograms that expose the tail, where stuck drains show up as multi-hour-tail outliers.

Relation to other cancellation anti-patterns

  • async cancellation thread-spawn anti-pattern — the orthogonal failure mode where cancellation works but at the cost of spawning threads per cancellation. This page's bug is simpler: cancellation is silently dropped.
  • request cancellation in consensus protocols — the structurally-necessary cancellation primitive for distributed leadership-change protocols. Orthogonal axis: that's about protocol design; this is about retry-loop plumbing.

Interaction with interrupted state-mutation

A ctx-cancellation bug in a retry loop is more dangerous when the loop is in the middle of a multi-step state mutation (acquire lock / mark excluded / wait / commit / unlock). Even if the bug is fixed so ctx cancellation does abort the loop, the subsequent cleanup phase needs to be idempotent (see patterns/cleanup-phase-survives-interruption) — else the fix exposes the next bug: zombie state left by the now-aborted loop. The 2024-06-20 Zalando incident trailed exactly this shape: Bug 1 (ctx cancellation) + Bug 2 (cleanup-on-interruption) composed to produce three consecutive morning alerts.

Seen in

Last updated · 501 distilled / 1,218 read