Skip to content

PATTERN Cited by 1 source

Isolated reproducer for race condition

Intent

After diagnosing a race condition's mechanism from production evidence (logs, coredumps, disassembly), build a minimal, self-contained, stdlib-only program that triggers the same crash in an isolated environment. Ideally:

  • No third-party dependencies.
  • No production data / configuration.
  • Runs on a developer laptop.
  • Deterministic enough to crash within seconds-to-minutes.

The minimal reproducer serves three purposes simultaneously:

  1. Validates the mechanism hypothesis. If your reproducer crashes for the same reason as production, your understanding is correct.
  2. Provides the upstream with a credible report. No maintainer can reject a bug report that consists of 35 lines of code plus a stack trace. See patterns/upstream-the-fix.
  3. Enables bisecting the fix. Regressions between toolchain versions, or the presence/absence of the bug across configurations, can be swept with git bisect + the reproducer.

Canonical instance: Cloudflare, 2025-10

The Cloudflare team hit a fatal-panic class that was geographically random, not correlated with any release or infrastructure change, and showed up in ~1 machine per day across <10 % of 330 data centers (sources/2025-10-08-cloudflare-we-found-a-bug-in-gos-arm64-compiler).

After diagnosing via a production coredump that the crash PC sat between two specific opcodes in an (*NetlinkSocket).Receive epilogue, they synthesised the hypothesis: "Go's arm64 codegen splits the SP adjustment into two ADD opcodes for frames > 4 KiB; async preemption landing between those opcodes leaves SP partially adjusted; GC stack-scan then crashes."

Their reproducer (~35 lines, stdlib only):

package main

import "runtime"

//go:noinline
func big_stack(val int) int {
    var big_buffer = make([]byte, 1 << 16)
    sum := 0
    for i := 0; i < (1<<16); i++ {
        big_buffer[i] = byte(val)
    }
    for i := 0; i < (1<<16); i++ {
        sum ^= int(big_buffer[i])
    }
    return sum
}

func main() {
    go func() {
        for { runtime.GC() }
    }()
    for {
        _ = big_stack(1000)
    }
}

Three ingredients for the crash:

  1. A function with a stack frame large enough to trigger the split-ADD epilogue on arm64. 1<<16 bytes of stack buffer does it.
  2. //go:noinline to prevent the compiler from optimising the function away and dropping the frame.
  3. A sibling goroutine calling runtime.GC() in a tight loop to force stack-unwinding as frequently as possible — this is what raises the probability that preemption lands mid-ADD and the unwinder runs before the goroutine leaves the epilogue.

Crashes on go1.23.4 / arm64 within ~90 seconds with the canonical fatal error: traceback did not unwind completely or SIGSEGV at m.incgo.

The value of stdlib-only

Pre-existing upstream issue golang#73259 had stack traces implicating the vishvananda/netlink library. Any hypothesis "the bug is in netlink" was technically falsifiable but practically hard to refute without a reproducer that had no netlink in it. The stdlib-only reproducer was a definitive refutation — if the Go runtime crashes on code that imports nothing but runtime, the library is not the culprit.

This is the general shape: strip every non-toolchain dependency until what remains is a pure toolchain-generated race.

Unknowns can be encoded in the reproducer itself

The Cloudflare team explicitly flagged that they did not fully understand all variables affecting reachability:

"This reproducer was originally written and tested on Go 1.23.4, but did not crash when compiled with 1.23.9 (the version in production), even though we could objdump the binary and see the split ADD still present. We don't have a definite explanation for this behavior — even with the bug present there remain a few unknown variables which affect the likelihood of hitting the race condition."

This is honest reporting. A reproducer that crashes sometimes and not others is still diagnostic — it narrows the investigation to the variables that affect reachability rather than the mechanism. A fix that eliminates the split ADD removes the race regardless of the remaining unknowns.

When to use this pattern

  • The bug is in shared infrastructure (compiler, runtime, library) that upstream maintainers will own the fix for.
  • Production evidence exists but lives behind private data, PII, or complex infrastructure that maintainers cannot reproduce.
  • The bug class is inherently scale-amplified and has probably been sitting latent for years — the production observation was a scale-surfaced rediscovery of a bug waiting to happen.

When NOT to use this pattern

  • Bug is in your own code — bisect your own repo; no reproducer for upstream needed.
  • Bug is a correctness issue with a straightforward test case — a standard unit test captures it.

Seen in

Last updated · 200 distilled / 1,178 read