Skip to content

CLOUDFLARE 2025-10-08 Tier 1

Read original ↗

Cloudflare — How we found a bug in Go's arm64 compiler

Summary

Weeks-long debugging retrospective on a one-instruction race condition in Go's arm64 code generator. On stack frames slightly larger than 1<<12 bytes, the Go compiler emitted the function epilogue's stack-pointer adjustment as two separate ADD opcodes (ADD x, RSP, RSP; ADD y<<12, RSP, RSP) rather than a single indivisible operation. If Go's runtime async preemption interrupted a goroutine between those two opcodes — a window of one arm64 instruction — the stack pointer pointed into the middle of the stack frame. Any subsequent stack unwinding (garbage collection scanning goroutine stacks, defer evaluation during panic recovery, traceback generation) would read an invalid parent frame and crash with either fatal error: traceback did not unwind completely or a SIGSEGV dereferencing a non-function pointer as if it were an m struct. Scale of 84 million HTTP requests per second across 330 cities surfaced the million-to-one race often enough to show up as up to 30 fatal panics per day across <10 % of Cloudflare's data centers. Fixed upstream in go1.23.12 / go1.24.6 / go1.25.0 by emitting a single indivisible ADD opcode via a temporary register.

Key takeaways

  • The crash site was remote from the root cause. All fatal panics occurred in (*unwinder).next during stack unwinding triggered by GC / panic recovery. Initial theory (stack memory corruption from old panic/recover error-handling code) "seemed to work" — panics dropped after that code was removed — then a month later panics returned at a higher rate with no correlation to anything (no release, no infrastructure change). The first round of theorising was pattern-matching on a Go Netlink GitHub issue (golang#73259) that matched their symptoms exactly; it turned out both were downstream of the same compiler bug. (Source: sources/2025-10-08-cloudflare-we-found-a-bug-in-gos-arm64-compiler)
  • The smoking gun was a coredump. Production coredump loaded into dlv; goroutine paused mid-preemption showed a program counter 0x555577cb2880 sitting between two epilogue opcodes of (*NetlinkSocket).Receive:
    nl_linux.go:779 0x555577cb2878  LDP -8(RSP), (R29, R30)
    nl_linux.go:779 0x555577cb287c  ADD $80, RSP, RSP
    nl_linux.go:779 0x555577cb2880  ADD $(16<<12), RSP, RSP   ← PC here
    nl_linux.go:779 0x555577cb2884  RET
    
    Log query confirmed "the majority of stack traces showed that this same opcode was preempted." (Source: sources/2025-10-08-cloudflare-we-found-a-bug-in-gos-arm64-compiler)
  • arm64 is a fixed-length 4-byte ISA with a 12-bit ADD immediate. To add values that don't fit in 12 bits, the architecture reserves a "shift-left-by-12" bit so any 24-bit addition can be decomposed into two opcodes. For stack frames ≤ 1<<12 bytes, Go emits a single ADD $n, RSP, RSP. For frames slightly larger (the reproducer uses 1<<16 + 8), the Go assembler (cmd/internal/obj/arm64/asm7.go) classified the immediate in conclass and emitted two ADD opcodes via the split-shifted pair. Intermediate state (one adjustment applied, one pending) is observable for one cycle.
  • Why this is a crash, not just memory corruption. The function is in its epilogue; stack data being corrupted is actively in the process of being thrown away. The fatal issue is that Go's runtime relies on the stack pointer being accurate during unwinding because it dereferences sp to locate the calling function. With sp partially modified, the unwinder reads sp → looks for a parent function → finds garbage. Either the return address is null (→ finishInternal throws traceback did not unwind completely) or it's non-zero-but-not-a-function (→ unwinder assumes goroutine is running, dereferences m.incgo at offset 0x118, segfaults). Offset 0x118 in struct m is the exact faulting address observed in production.
  • Async preemption makes unwinding happen at arbitrary instruction boundaries. Pre-Go-1.14 scheduling was cooperative — goroutines yielded only at explicit points (runtime.Gosched(), function prologues, I/O). Go 1.14 introduced async preemption: the sysmon thread sends SIGURG to any OS thread whose goroutine has run >10 ms, and the signal handler mutates the program counter and stack to mimic a call to asyncPreempt. This is the mechanism that widened the race window from "nil" to "one instruction" — and made this bug reachable.
  • Minimal reproducer is stdlib-only, ~35 lines. Function with a 64 KiB stack-allocated buffer (forces the split-ADD epilogue), tight inner loop preventing compiler elision, main goroutine calling it in a loop and a sibling goroutine spinning on runtime.GC() (forces stack unwinding as often as possible). Crashes within ~90 s on go1.23.4 with the canonical fatal error: traceback did not unwind completely / SIGSEGV at m.incgo. (Notably the same reproducer did not crash on go1.23.9 even though objdump showed the split ADD still present — "some of the behavior is still puzzling. It's a one-instruction race condition, so it's unsurprising that small changes could have large impact.")
  • The upstream fix is to make SP adjustment indivisible. Previously the Go compiler emitted a single add x, RSP construct at the IR level (obj.Prog) and relied on the assembler (asm7.go) to split the immediate when needed. The fix changes the compiler to build the offset in a temporary register and add that to RSP in a single ADD R27, RSP, RSP:
    LDP -8(RSP), (R29, R30)
    MOVD $32, R27
    MOVK $(1<<16), R27
    ADD R27, RSP, RSP
    RET
    
    Preemption can interrupt before or after the indivisible ADD R27, RSP, RSP, but never in the middle. Ships in go1.23.12, go1.24.6, go1.25.0.
  • Scale was the observational advantage. "Every second, 84 million HTTP requests are hitting Cloudflare across our fleet of data centers in 330 cities. It means that even the rarest of bugs can show up frequently." At peak this bug produced up to 30 daily fatal panics across <10 % of data centers (i.e. ~1 machine per day). "The sort of bug that can only really be quantified at a large scale."
  • Mitigation before root-cause was a panic/recover audit. Cloudflare's first intervention — removing a legacy panic/recover error-handling pattern — tactically reduced fatal panic rates because recover() also walks the goroutine stack via defer. Removing avoidable stack unwindings narrowed the bug's visible footprint without understanding it. Classic operational shape: buy time with a narrow mitigation while the deeper hunt continues.
  • The Go Netlink library was a red herring. Every observed crash had (*NetlinkSocket).Receive on the stack and the library used unsafe.Pointer; the tokio-rustls-style "corrupt stack from user code" hypothesis was plausible but wrong. The real reason Netlink showed up was that (*NetlinkSocket).Receive happened to have a large stack frame that triggered the split ADD epilogue. The lesson: "small changes could have large impact" — stack-size crosses the 1<<12 threshold → epilogue shape changes → reachable race surface changes.

Operational numbers

  • Observational scale: 84 M HTTP req/s × 330 cities.
  • Crash rate at peak: up to 30 fatal panics per day across <10 % of data centers (≈ 1 machine / day).
  • Stack-frame threshold: split ADD epilogue kicks in at frame size > 1<<12 = 4096 bytes on arm64.
  • Faulting instruction granularity: one arm64 instruction (4 bytes). Async-preemption window was exactly one cycle.
  • Preemption trigger: Go runtime's sysmon sends SIGURG to goroutines running >10 ms.
  • Dereferenced offset in struct m: 0x118 = offset of incgo field.
  • Reproducer time-to-crash: ~90 s on go1.23.4 / arm64.
  • Upstream fix shipped: go1.23.12, go1.24.6, go1.25.0.

Caveats

  • Not every variable affecting reachability is understood — the original reproducer crashed on go1.23.4 but not on go1.23.9 with the split ADD still objdump-visible. "There remain a few unknown variables which affect the likelihood of hitting the race condition."
  • Cloudflare did not disclose how much production traffic was lost or degraded across the weeks-long investigation; the impacted service is a "largely idle control plane service where unplanned restarts have negligible impact."
  • No breakdown of how far panic/recover removal reduced the rate (first mitigation phase) vs. how much latent exposure remained through the second phase.
  • The article is silent on why the compiler's obj.Prog-level IR didn't carry immediate-length awareness in the first place ("Notably, this IR is not aware of immediate length limitations.") — whether that's considered a design debt to clean up beyond this specific ADD $n, RSP, RSP case.
  • No per-Go-version production rollout timeline (e.g. which minor release was in production during which phase).
  • amd64 is not affected because the x86 ISA has a single variable-length ADD imm, reg opcode that can encode 32-bit immediates directly — the immediate-encoding limit is a fixed-length ISA concern.

Source

Last updated · 200 distilled / 1,178 read