DATADOG 2025-07-17 Tier 3

Datadog — How we tracked down a Go 1.24 memory regression across hundreds of pods¶

Summary¶

Datadog rolled Go 1.24 to a data-processing service across hundreds of Kubernetes pods and observed a ~20% RSS increase that did not appear in Go's own runtime metrics. A staging bisect pinpointed the upgrade; disabling the two suspected 1.24 features (GOEXPERIMENT=noswissmap, GOEXPERIMENT=nospinbitmutex) failed to reproduce the fix. The team drove the investigation from /proc/[pid]/smaps + live heap profiles → Gophers Slack collaboration with PJ Malloy (thepudds) → heapbench reproducer → git bisect on the Go runtime → identified CL 614257 (mallocgc refactor) as having silently removed an optimization that avoided re-zeroing large (>32 KiB) pointer-containing allocations obtained fresh from the OS. Michael Knyszek on the Go team confirmed and authored the fix (CL 659956), to ship in Go 1.25 with a 1.24 backport. The post is a methodical worked example of debugging a regression that is invisible to the runtime's own instrumentation by dropping one layer below it to OS-level memory accounting.

Key takeaways¶

Go runtime metrics track virtual memory; RSS tracks resident physical memory — the two can diverge silently. Runtime/metrics (from the runtime/metrics package, exposed since Go 1.16) showed no change across the 1.23→1.24 upgrade, while system-level RSS grew ~20%. OS / Kubernetes memory limits and the Linux OOM killer use RSS, so a Go-invisible regression can still OOM-kill pods. The Go runtime's internal accounting is not the ground truth for "how much memory am I using?" (concepts/go-runtime-memory-model).
/proc/[pid]/smaps localized the problem to a single VMA. The Go 1.24 smaps dump showed the early-mapped r/w Go-heap region at Size: 1.28 GiB, Rss: 1.26 GiB (near-full commit); on Go 1.23 the same mapping was Size: 1.33 GiB, Rss: 1.04 GiB (~300 MiB uncommitted). This isolated the regression to the Go heap — not stacks, not mmap'd files, not cgo allocations.
Live heap profiles pointed to the shape of the workload triggering it. The impacted service's heap was ~50% buffered channels and ~20% map[string]someStruct — large allocations containing pointers. thepudds ran heapbench across the matrix of {channel/map/slice} × {≤32 KiB, >32 KiB} × {pointer-free, pointer-bearing} and found that only large (>32 KiB) pointer-bearing allocations showed the regression (~2× RSS for buffered channels of pointer structs). Small allocations and non-pointer data were unaffected.
Root cause: the mallocgc refactor lost a "don't re-zero OS-fresh memory" optimization. Go historically skipped zeroing large (>32 KiB) pointer-containing allocations when the backing pages were freshly obtained from the kernel (kernel zeros pages before handing them out). CL 614257 (Go 1.24) refactored mallocgc and inadvertently removed that skip, causing an unconditional memclr on every large pointer-bearing allocation. That write commits the previously virtual-only pages to physical RAM → RSS rises while the runtime's "heap in use" counter is unchanged. Both observations (virtual≈resident convergence, pointer+large workloads hit hardest) fall out of this explanation (concepts/go-runtime-memory-model).
GOEXPERIMENT flags function as targeted A/B levers for runtime hypotheses. Disabling Swiss Tables (GOEXPERIMENT=noswissmap) and spin-bit mutex (GOEXPERIMENT=nospinbitmutex) in test builds let the team rule out the two headline 1.24 changes in hours, without reverting the upgrade, before the deeper investigation started (patterns/bisect-driven-regression-hunt).
The debugging workflow was an externalised bisect with an upstream collaborator. (a) Production signal. (b) Staging bisect on Go version. (c) GOEXPERIMENT A/B on suspected features. (d) runtime/metrics diff → no signal. (e) Drop one layer: /proc/[pid]/smaps → only the heap VMA regressed. (f) Live heap profile → workload shape (large + pointers + channels/maps). (g) Read Go 1.24 changelog with that shape in mind → mallocgc refactor as a hypothesis. (h) Gophers Slack thread → PJ Malloy runs heapbench across allocation-shape matrix → targeted repro confirms the shape. (i) git bisect inside Go repo → CL 614257. (j) Upstream issue filed → Go team confirms and fixes (CL 659956). (k) Cherry-pick the fix, validate on the original service, report back. This is the canonical shape of patterns/bisect-driven-regression-hunt for runtime/library regressions.
Impact: high-traffic environment improved anyway. After shipping 1.24 (regression included) across production, low-traffic pods showed virtual memory converging toward RSS (as predicted by the regression model), but the highest-traffic environment saw virtual memory drop ~1 GiB/pod (~20%) and RSS drop ~600 MiB/pod (~12%) — net gains attributed to Swiss Tables' reduced overhead on large maps (subject of the follow-up post). The regression did not block rollout; it reshaped predictions so the team could validate headroom before each stage.
Language-runtime regressions are a distinct risk class for fleet-upgraded managed-memory services. A compiler/runtime version bump is a one-line diff with fleet-wide blast radius; the bug class here — extra page commits triggered by a specific allocation shape — is invisible to the program and to the runtime, surfaces only under real production workloads, and is gated by Kubernetes memory limits rather than by application correctness. The mitigation is not code review but environment-level bisect + OS-level memory observation as standard practice for toolchain upgrades.

Architectural details / numbers¶

Memory-region identification in /proc/[pid]/smaps: the Go heap is typically the first large r/w anonymous mapping near the executable's address space; size ≈ total Go virtual memory, RSS ≈ physical pages committed into it. Recent upstream change by Lénaïc Huard (Datadog Container Integrations) labels Go-allocated memory regions in maps / smaps for easier identification.
The regression's algebra: for a pointer-bearing allocation of size s > 32 KiB obtained fresh from the OS, Go pre-1.24 issued 0 stores (OS pages already zero); Go 1.24 post-CL-614257 issued ⌈s / page_size⌉ page-faulting stores. All of those pages that the program would not otherwise have touched become resident.
Why the highest-traffic env dropped despite the regression: Swiss Tables' more compact map layout reduced the working set of a large in-memory map by enough to overwhelm the (~20%) zero-storm cost. Lower-traffic pods have smaller maps → no offsetting Swiss Tables win → net ~20% regression. This mixed signal is itself a lesson: a fleet-wide toolchain upgrade does not have a single "memory delta" number; the delta is a function of workload shape.
The fix (CL 659956) restored the skip-zeroing optimization for OS-fresh memory and, in the process, tightened the memory-ordering story around GC'd allocations to ensure the visible-pointer / allocation-type bits commit before the GC can observe the allocation as live. Shipping in Go 1.25; Go issue #73800 tracks the 1.24 backport.

Caveats / open questions¶

The post is Part 1 of 2. Part 2 covers the Swiss Tables win in the highest-traffic env. For this ingest we take only Part 1's scope.
The article does not quantify how long each investigation step took, but the sequence (bisect → GOEXPERIMENT A/B → smaps → heap profile → Slack thread → heapbench bisect → upstream issue → fix → cherry-pick validate) is explicitly presented as the runbook for regressions of this class.
Whether the original mallocgc CL had benchmarks covering large-pointer-bearing channel/map allocations is not discussed; the implicit lesson is that runtime-internal refactors are hard to regression-test against arbitrary user workload shapes, which is why upstream collaboration is the realistic defence.
The regression affects only allocations fresh from the OS — pages already reused by the allocator would not be zero. So a steady-state long-running process with a stable allocator footprint sees less impact than a process with churning large pointer allocations.

Links¶

Original post: https://www.datadoghq.com/blog/engineering/go-memory-regression/
Raw: raw/datadog/2025-07-17-how-we-tracked-down-a-go-124-memory-regression-a59b707f.md
HN discussion: https://news.ycombinator.com/item?id=44597550 (191 points)
Part 2 (pending future ingest): https://www.datadoghq.com/blog/engineering/go-swiss-tables/
Regression CL (cause): https://go.dev/cl/614257
Fix CL: https://go.dev/cl/659956
Upstream issue: https://github.com/golang/go/issues/72991
1.24 backport tracking: https://github.com/golang/go/issues/73800
Go memory metrics primer (Datadog, earlier post): https://www.datadoghq.com/blog/go-memory-metrics/
heapbench (PJ Malloy / thepudds): https://github.com/thepudds/heapbench/tree/dev-go124-rss-regression