PATTERN Cited by 1 source

Read the code for partial-failure bugs¶

Intent¶

When debugging failures in a distributed or orchestrated system, read the source code of the components involved rather than relying on documentation, architecture diagrams, or the intended contract. Reserve abstract reasoning for the first hypothesis; switch to code-reading the moment the abstract model fails to explain observed behaviour. The substrate of any real system is its code; partial-failure bugs hide in sub-loops, missing cleanup branches, and error-handling cases that documentation doesn't describe.

Canonical wiki framing¶

From the closing section of Zalando Lounge 2024-06-20:

"What did we learn from all this? Well, Read the code. For solving difficult problems, understanding the related processes in abstract terms might not be enough. The details matter, and the code is the final documentation for those. It also mercilessly reveals any bugs that lurk around."

(Source: sources/2024-06-20-zalando-failing-to-auto-scale-elasticsearch-in-kubernetes)

The incident is a useful instance of the pattern in two ways. First, the Zalando team describes their reasoning trajectory: they started with the intended contract ("the operator aborts reconcile on spec change"), found that the observed behaviour violated it, and spent "much of the next day tracing through es-operator source code" before finding the one specific retry loop with the ctx-cancellation bug. Second, the nature of both bugs uncovered (ctx-cancellation ignored in one loop + missing idempotent cleanup) is exactly the shape that is invisible in abstract models — the model says "reconcile aborts on spec change"; the code says "every reconcile path except this one."

When to invoke¶

Observed behaviour contradicts the intended model. Specifically: the system should do X but is doing Y, and nobody reviewing the spec / docs / diagrams can see where Y could come from.
Bug is transient or edge-case. Partial-failure bugs don't reproduce on casual testing — they need a specific interleaving of events (crash mid-transaction, new spec arriving during in-flight reconcile, retry exceeding budget exactly). Documentation doesn't catalog these interleavings.
The component is open-source and readable. Zalando Lounge could trace through es-operator because it's Zalando's own open-source project. The pattern applies equally to open-source operators / libraries / kernels / databases — any component you have source for.
You've already tried the abstract model and it ran out. This is the tell: "based on the README this should work, but it doesn't, and I don't know what's missing."

How to do it effectively¶

Trace from the symptom, not from the entry point. Start at the observed failure (the drain hung on pod v2-6) and work backwards through the code paths that could have produced it, not forwards from main.
Ignore abstract-model prose in code comments. Comments lag behind code. A comment saying "this loop aborts on ctx.Done" over a loop that doesn't is exactly the trap. Read the control flow, not the documentation-in-place.
Read with the failure mode in mind. You're not trying to understand the whole system; you're trying to find the specific path where X-should-happen-but-doesn't. Narrow the search: which functions touch the state that's wrong?
Compare happy-path to error-path. Partial-failure bugs almost always live in error-handling branches that are underweighted relative to happy-path code. If the happy-path flow is 100 lines and the error-path is 3 lines, read the 3 lines with more suspicion.
Suspect every loop. For ctx-cancellation bugs specifically: grep for loops (for, while, time.After, retry libraries) and check each one for ctx.Done / equivalent.
Suspect every multi-step state mutation. For zombie-state bugs: identify every place the code writes to shared external state, trace whether the undo is in a success-path branch, and note the call sites where that branch might be skipped.

Cost¶

The pattern is expensive — Zalando spent "much of the next day" in the es-operator source. That's justified by the severity of the failure mode (recurring morning production alerts on a critical workload) but is not free. Several teams' worth of opportunity cost per deep-dive.

Mitigate by:

Stopping when the model is sufficient. Not every bug needs source-reading — many are obvious from the description.
Pair-reading. Two people trace the same code path faster than one because they cross-check each other's assumptions.
Recording what you find. The findings from the Zalando trace are now in two PRs (#405, #423) and this blog post — future incidents of the same shape can reference them directly.

Complementary practices¶

Reproduce in a test harness once you have a hypothesis. "If I cancel ctx during this retry loop, does the operator behave as expected?" is a precise test to write once you've narrowed the suspect region.
Upstream the fix. Zalando merged PR #405; the benefit accrues to every es-operator user. See upstream the fix as the natural follow-through pattern.
Write the post-mortem. The blog post itself is the durable artifact of the code-reading session — others hitting the same failure mode can start with the write-up rather than re-deriving the trace.

Contrast with "trust the abstraction"¶

This pattern is the counterweight to the common engineering discipline of "design at the abstract level, implement from the spec, trust the implementation." That discipline scales well for forward-engineering new systems. It scales poorly for debugging partial-failure bugs in systems under real load, because the bug is in the implementation that the abstraction claims to correctly realize. The Zalando Lounge incident is an object lesson: two bugs that were invisible at any level of abstract modelling were both plainly visible in ~50-line stretches of the drain code.

Seen in¶

sources/2024-06-20-zalando-failing-to-auto-scale-elasticsearch-in-kubernetes — canonical wiki instance. Closing lesson of the post is literally "Read the code." Zalando's team spent a full day tracing es-operator source to find a ctx-cancellation omission in one retry loop and a missing idempotent cleanup in the drain path — both bugs invisible at the abstract-operator-contract level. Two PRs merged / WIP as a result (#405, #423).

systems/es-operator
concepts/context-cancellation-ignored-in-retry-loop — the bug class source-reading revealed at Zalando Lounge.
concepts/zombie-exclusion-list-state — the other bug class.
concepts/operator-reconcile-abort-on-spec-change — the intended contract that source-reading found the violation of.
patterns/cleanup-phase-survives-interruption — the structural fix source-reading ultimately pointed to.