PATTERN Cited by 1 source
Isolated reproducer for race condition¶
Intent¶
After diagnosing a race condition's mechanism from production evidence (logs, coredumps, disassembly), build a minimal, self-contained, stdlib-only program that triggers the same crash in an isolated environment. Ideally:
- No third-party dependencies.
- No production data / configuration.
- Runs on a developer laptop.
- Deterministic enough to crash within seconds-to-minutes.
The minimal reproducer serves three purposes simultaneously:
- Validates the mechanism hypothesis. If your reproducer crashes for the same reason as production, your understanding is correct.
- Provides the upstream with a credible report. No maintainer can reject a bug report that consists of 35 lines of code plus a stack trace. See patterns/upstream-the-fix.
- Enables bisecting the fix. Regressions between
toolchain versions, or the presence/absence of the bug
across configurations, can be swept with
git bisect+ the reproducer.
Canonical instance: Cloudflare, 2025-10¶
The Cloudflare team hit a fatal-panic class that was geographically random, not correlated with any release or infrastructure change, and showed up in ~1 machine per day across <10 % of 330 data centers (sources/2025-10-08-cloudflare-we-found-a-bug-in-gos-arm64-compiler).
After diagnosing via a production coredump that the crash PC
sat between two specific opcodes in an (*NetlinkSocket).Receive
epilogue, they synthesised the hypothesis: "Go's arm64
codegen splits the SP adjustment into two ADD opcodes for
frames > 4 KiB; async preemption landing between those opcodes
leaves SP partially adjusted; GC stack-scan then crashes."
Their reproducer (~35 lines, stdlib only):
package main
import "runtime"
//go:noinline
func big_stack(val int) int {
var big_buffer = make([]byte, 1 << 16)
sum := 0
for i := 0; i < (1<<16); i++ {
big_buffer[i] = byte(val)
}
for i := 0; i < (1<<16); i++ {
sum ^= int(big_buffer[i])
}
return sum
}
func main() {
go func() {
for { runtime.GC() }
}()
for {
_ = big_stack(1000)
}
}
Three ingredients for the crash:
- A function with a stack frame large enough to trigger
the split-
ADDepilogue on arm64.1<<16bytes of stack buffer does it. //go:noinlineto prevent the compiler from optimising the function away and dropping the frame.- A sibling goroutine calling
runtime.GC()in a tight loop to force stack-unwinding as frequently as possible — this is what raises the probability that preemption lands mid-ADDand the unwinder runs before the goroutine leaves the epilogue.
Crashes on go1.23.4 / arm64 within ~90 seconds with the
canonical fatal error: traceback did not unwind completely
or SIGSEGV at m.incgo.
The value of stdlib-only¶
Pre-existing upstream issue
golang#73259 had
stack traces implicating the vishvananda/netlink library. Any
hypothesis "the bug is in netlink" was technically falsifiable
but practically hard to refute without a reproducer that had
no netlink in it. The stdlib-only reproducer was a definitive
refutation — if the Go runtime crashes on code that imports
nothing but runtime, the library is not the culprit.
This is the general shape: strip every non-toolchain dependency until what remains is a pure toolchain-generated race.
Unknowns can be encoded in the reproducer itself¶
The Cloudflare team explicitly flagged that they did not fully understand all variables affecting reachability:
"This reproducer was originally written and tested on Go 1.23.4, but did not crash when compiled with 1.23.9 (the version in production), even though we could objdump the binary and see the split ADD still present. We don't have a definite explanation for this behavior — even with the bug present there remain a few unknown variables which affect the likelihood of hitting the race condition."
This is honest reporting. A reproducer that crashes sometimes
and not others is still diagnostic — it narrows the investigation
to the variables that affect reachability rather than the
mechanism. A fix that eliminates the split ADD removes the
race regardless of the remaining unknowns.
When to use this pattern¶
- The bug is in shared infrastructure (compiler, runtime, library) that upstream maintainers will own the fix for.
- Production evidence exists but lives behind private data, PII, or complex infrastructure that maintainers cannot reproduce.
- The bug class is inherently scale-amplified and has probably been sitting latent for years — the production observation was a scale-surfaced rediscovery of a bug waiting to happen.
When NOT to use this pattern¶
- Bug is in your own code — bisect your own repo; no reproducer for upstream needed.
- Bug is a correctness issue with a straightforward test case — a standard unit test captures it.
Seen in¶
- sources/2025-10-08-cloudflare-we-found-a-bug-in-gos-arm64-compiler — canonical wiki instance. 35-line stdlib-only Go reproducer isolated a compiler/runtime race condition that had been producing scattered production fatal panics for weeks; became the basis for the upstream fix.
Related¶
- patterns/bisect-driven-regression-hunt — complementary pattern. Bisect is about narrowing a regression's range; reproducer is about capturing the mechanism in minimal form. Both often combine — bisect the toolchain using the reproducer.
- patterns/upstream-the-fix — the follow-up; the reproducer is the artifact that makes upstream engagement productive.
- concepts/split-instruction-race-window — the class of bug the canonical reproducer surfaces.
- concepts/compiler-generated-race-condition — the broader failure class.