Cloudflare — How we found a bug in Go's arm64 compiler¶
Summary¶
Weeks-long debugging retrospective on a one-instruction race
condition in Go's arm64 code generator. On stack frames
slightly larger than 1<<12 bytes, the Go compiler emitted the
function epilogue's stack-pointer adjustment as two separate
ADD opcodes (ADD x, RSP, RSP; ADD y<<12, RSP, RSP) rather
than a single indivisible operation. If Go's runtime
async preemption interrupted a
goroutine between those two opcodes — a window of one arm64
instruction — the stack pointer pointed into the middle of the
stack frame. Any subsequent stack
unwinding (garbage collection scanning goroutine stacks,
defer evaluation during panic recovery, traceback generation)
would read an invalid parent frame and crash with either
fatal error: traceback did not unwind completely or a
SIGSEGV dereferencing a non-function pointer as if it were an
m struct. Scale of 84 million HTTP requests per second
across 330 cities surfaced the million-to-one race often
enough to show up as up to 30 fatal panics per day across <10 %
of Cloudflare's data centers. Fixed upstream in go1.23.12 /
go1.24.6 / go1.25.0 by emitting a single indivisible ADD
opcode via a temporary register.
Key takeaways¶
- The crash site was remote from the root cause. All fatal
panics occurred in
(*unwinder).nextduring stack unwinding triggered by GC / panic recovery. Initial theory (stack memory corruption from old panic/recover error-handling code) "seemed to work" — panics dropped after that code was removed — then a month later panics returned at a higher rate with no correlation to anything (no release, no infrastructure change). The first round of theorising was pattern-matching on a Go Netlink GitHub issue (golang#73259) that matched their symptoms exactly; it turned out both were downstream of the same compiler bug. (Source: sources/2025-10-08-cloudflare-we-found-a-bug-in-gos-arm64-compiler) - The smoking gun was a coredump. Production coredump loaded
into
dlv; goroutine paused mid-preemption showed a program counter0x555577cb2880sitting between two epilogue opcodes of(*NetlinkSocket).Receive:Log query confirmed "the majority of stack traces showed that this same opcode was preempted." (Source: sources/2025-10-08-cloudflare-we-found-a-bug-in-gos-arm64-compiler)nl_linux.go:779 0x555577cb2878 LDP -8(RSP), (R29, R30) nl_linux.go:779 0x555577cb287c ADD $80, RSP, RSP nl_linux.go:779 0x555577cb2880 ADD $(16<<12), RSP, RSP ← PC here nl_linux.go:779 0x555577cb2884 RET - arm64 is a fixed-length 4-byte ISA with a 12-bit
ADDimmediate. To add values that don't fit in 12 bits, the architecture reserves a "shift-left-by-12" bit so any 24-bit addition can be decomposed into two opcodes. For stack frames≤ 1<<12bytes, Go emits a singleADD $n, RSP, RSP. For frames slightly larger (the reproducer uses1<<16 + 8), the Go assembler (cmd/internal/obj/arm64/asm7.go) classified the immediate inconclassand emitted twoADDopcodes via the split-shifted pair. Intermediate state (one adjustment applied, one pending) is observable for one cycle. - Why this is a crash, not just memory corruption. The
function is in its epilogue; stack data being corrupted is
actively in the process of being thrown away. The fatal issue
is that Go's runtime relies on the stack pointer being
accurate during unwinding because it dereferences
spto locate the calling function. Withsppartially modified, the unwinder readssp→ looks for a parent function → finds garbage. Either the return address is null (→finishInternalthrowstraceback did not unwind completely) or it's non-zero-but-not-a-function (→ unwinder assumes goroutine is running, dereferencesm.incgoat offset0x118, segfaults). Offset0x118instruct mis the exact faulting address observed in production. - Async preemption makes unwinding happen at arbitrary
instruction boundaries. Pre-Go-1.14 scheduling was
cooperative — goroutines yielded only at explicit points
(
runtime.Gosched(), function prologues, I/O). Go 1.14 introduced async preemption: thesysmonthread sendsSIGURGto any OS thread whose goroutine has run >10 ms, and the signal handler mutates the program counter and stack to mimic a call toasyncPreempt. This is the mechanism that widened the race window from "nil" to "one instruction" — and made this bug reachable. - Minimal reproducer is stdlib-only, ~35 lines. Function
with a 64 KiB stack-allocated buffer (forces the split-ADD
epilogue), tight inner loop preventing compiler elision, main
goroutine calling it in a loop and a sibling goroutine
spinning on
runtime.GC()(forces stack unwinding as often as possible). Crashes within ~90 s on go1.23.4 with the canonicalfatal error: traceback did not unwind completely/ SIGSEGV atm.incgo. (Notably the same reproducer did not crash on go1.23.9 even thoughobjdumpshowed the split ADD still present — "some of the behavior is still puzzling. It's a one-instruction race condition, so it's unsurprising that small changes could have large impact.") - The upstream fix is to make SP adjustment indivisible.
Previously the Go compiler emitted a single
add x, RSPconstruct at the IR level (obj.Prog) and relied on the assembler (asm7.go) to split the immediate when needed. The fix changes the compiler to build the offset in a temporary register and add that toRSPin a singleADD R27, RSP, RSP: Preemption can interrupt before or after the indivisibleADD R27, RSP, RSP, but never in the middle. Ships in go1.23.12, go1.24.6, go1.25.0. - Scale was the observational advantage. "Every second, 84 million HTTP requests are hitting Cloudflare across our fleet of data centers in 330 cities. It means that even the rarest of bugs can show up frequently." At peak this bug produced up to 30 daily fatal panics across <10 % of data centers (i.e. ~1 machine per day). "The sort of bug that can only really be quantified at a large scale."
- Mitigation before root-cause was a
panic/recoveraudit. Cloudflare's first intervention — removing a legacypanic/recovererror-handling pattern — tactically reduced fatal panic rates becauserecover()also walks the goroutine stack viadefer. Removing avoidable stack unwindings narrowed the bug's visible footprint without understanding it. Classic operational shape: buy time with a narrow mitigation while the deeper hunt continues. - The Go Netlink library was a red herring. Every observed
crash had
(*NetlinkSocket).Receiveon the stack and the library usedunsafe.Pointer; thetokio-rustls-style "corrupt stack from user code" hypothesis was plausible but wrong. The real reason Netlink showed up was that(*NetlinkSocket).Receivehappened to have a large stack frame that triggered the splitADDepilogue. The lesson: "small changes could have large impact" — stack-size crosses the1<<12threshold → epilogue shape changes → reachable race surface changes.
Operational numbers¶
- Observational scale: 84 M HTTP req/s × 330 cities.
- Crash rate at peak: up to 30 fatal panics per day across <10 % of data centers (≈ 1 machine / day).
- Stack-frame threshold: split
ADDepilogue kicks in at frame size >1<<12= 4096 bytes on arm64. - Faulting instruction granularity: one arm64 instruction (4 bytes). Async-preemption window was exactly one cycle.
- Preemption trigger: Go runtime's
sysmonsendsSIGURGto goroutines running >10 ms. - Dereferenced offset in
struct m:0x118= offset ofincgofield. - Reproducer time-to-crash: ~90 s on go1.23.4 / arm64.
- Upstream fix shipped: go1.23.12, go1.24.6, go1.25.0.
Caveats¶
- Not every variable affecting reachability is understood — the original reproducer crashed on go1.23.4 but not on go1.23.9 with the split ADD still objdump-visible. "There remain a few unknown variables which affect the likelihood of hitting the race condition."
- Cloudflare did not disclose how much production traffic was lost or degraded across the weeks-long investigation; the impacted service is a "largely idle control plane service where unplanned restarts have negligible impact."
- No breakdown of how far
panic/recoverremoval reduced the rate (first mitigation phase) vs. how much latent exposure remained through the second phase. - The article is silent on why the compiler's
obj.Prog-level IR didn't carry immediate-length awareness in the first place ("Notably, this IR is not aware of immediate length limitations.") — whether that's considered a design debt to clean up beyond this specificADD $n, RSP, RSPcase. - No per-Go-version production rollout timeline (e.g. which minor release was in production during which phase).
- amd64 is not affected because the x86 ISA has a single
variable-length
ADD imm, regopcode that can encode 32-bit immediates directly — the immediate-encoding limit is a fixed-length ISA concern.
Source¶
- Original: https://blog.cloudflare.com/how-we-found-a-bug-in-gos-arm64-compiler/
- Raw markdown:
raw/cloudflare/2025-10-08-we-found-a-bug-in-gos-arm64-compiler-abca0869.md
Related¶
- systems/go-compiler — the IR level that emitted the
split-ADD (
obj.Progincmd/internal/obj/arm64/obj7.go); fix moved the SP-adjustment into an indivisible opcode. - systems/go-assembler —
asm7.go'sconclassthat classified the immediate and emitted theADD + ADD<<12pair. - systems/go-runtime-scheduler — the g/m/p types and
sysmonpreemption thread. - systems/arm64-isa — fixed-length 4-byte ISA with 12-bit
ADDimmediate; split-shifted-pair encoding for wider adds. - systems/go-netlink — incidental trigger (large stack
frame on
NetlinkSocket.Receive); the red herring that matched an existing upstream issue. - concepts/async-preemption-go — Go 1.14+ SIGURG-based preemption that widened the race window.
- concepts/stack-unwinding — GC / panic / traceback all
require
spto be valid; partial SP adjustment breaks the invariant. - concepts/split-instruction-race-window — generalised shape: a multi-opcode adjustment of shared state creates a one-instruction race window.
- concepts/immediate-encoding-limit — fixed-length ISA constraint forcing immediates to be split across multiple opcodes.
- concepts/compiler-generated-race-condition — codegen, not user code, introduced the non-atomic sequence.
- concepts/m-n-scheduler — Go's lightweight M:N scheduler context.
- concepts/kernel-panic-from-scale — structurally adjacent wiki concept: scale amplifies rare failure modes into observable ones.
- patterns/isolated-reproducer-for-race-condition — stdlib-only minimal reproducer in ~35 lines.
- patterns/preemption-safe-compiler-emit — emit opcode sequences that are safe under async preemption (indivisible updates of runtime-observable state).
- patterns/upstream-the-fix — fourth wiki instance; Cloudflare drove the Go toolchain fix rather than work-around in their own code.
- companies/cloudflare — Cloudflare's scale-amplifies- rare-bugs posture applied to the toolchain layer.