PATTERN Cited by 1 source
Flamegraph to upstream fix¶
Intent¶
End-to-end debugging discipline for the class of incidents where a CPU-burning, low-work symptom in your service is caused by a bug several layers deep in an open-source dependency. The arc:
- Incident: tripwire fires (CPU %, error rate, some downstream SLO). Tactical mitigation (process bounce, traffic shedding, partner stops load-testing) buys time but does not resolve.
- Flamegraph from an angry host. Read the rank-ordering of hot frames.
- Shape-recognise the pathology. Infrastructure dominating the profile + low kernel time + monotonically pegged CPU → busy-loop / CPU busy-loop.
- Type-signature narrow. Fully-qualified async-Rust / templated-C++ / generic-Scala frames give you the full wrapper chain around the bug. Audit your own wrappers first (recent changes, reproducibility), isolate the one foreign dependency left.
- Find the issue in the upstream tracker (often it's known — Rustls had tokio-rustls#72 open for this class of Waker bug).
- Fix it upstream. Submit a PR to the dependency, not a patch to your own fork.
- Validate in the real world. Resume the trigger (partner load test, replayed traffic) on the patched build; confirm the symptom doesn't come back.
- Instrument for next time (patterns/spurious-wakeup-metric and/or targeted profile-on-alarm): the bug class is supposed to be rare, so cheap detection should exist.
Canonical instance¶
Fly.io, 2025-02 (Source: sources/2025-02-26-flyio-taming-a-voracious-rust-proxy):
- IAD edge hosts pegged CPU + HTTP errors.
- Pavel pulled a flamegraph.
tracing::Subscriberdominance → "something's haywire" → busy-loop signature.- The nested
Futuretype namedtokio_rustls::server::TlsStream <…>as the one non-own-code layer left. - Pre-existing issue
tokio-rustls#72
matched the profile —
close_notifywith buffered trailer. - rustls PR #1950 shipped the fix upstream.
- Tigris (the partner whose load test triggered it) resumed — "no spin-outs."
- Fly.io committed to adding a spurious-wakeup metric in their own instrumentation going forward.
Why upstream, not around¶
For the specific case of rustls, the Fly.io post is blunt: "TlsStream is an ultra-important, load-bearing function in the Rust ecosystem. Everybody uses it." The bug is an ecosystem liability, not a Fly.io liability. Patching your own fork optimises for this service today and leaves the landmine for every other Rust-TLS user. The upstream-the-fix pattern recognises that when the bug is in a shared primitive, your fix is leverage proportional to the primitive's footprint.
Applicability¶
- Strongly applicable: production services on
load-bearing-ecosystem async runtimes
(Tokio,
asyncio, Node.js, Go runtime), TLS libraries (systems/rustls, OpenSSL, BoringSSL), generic-C++ compile chains (templated containers, allocators). - Less applicable: your service is the application; the bug is at your altitude; no-one downstream of you would inherit.
Sibling patterns¶
- patterns/upstream-the-fix — generalised governance shape (Cloudflare V8/Node.js/OpenNext, Datadog containerd/ kubernetes/go-cmp, this). This pattern is the diagnostic workflow that ends with an upstream-the-fix contribution.
- patterns/spurious-wakeup-metric — the follow-up instrumentation commitment.
- patterns/measurement-driven-micro-optimization — same profile-first discipline at a different goal (performance, not correctness).
Seen in¶
- sources/2025-02-26-flyio-taming-a-voracious-rust-proxy —
canonical wiki instance. Symptom (CPU + HTTP-errors in IAD)
→ flamegraph from angry proxy →
tracing::Subscriberhot frames →Futuretype points attokio_rustls::TlsStream→ known issue on upstream → PR submitted → partner load-test resumed clean.
Related¶
- concepts/flamegraph-profiling — the diagnostic tool.
- concepts/cpu-busy-loop-incident / concepts/spurious-wakeup-busy-loop — the incident class this pattern resolves.
- patterns/upstream-the-fix — the governance end-state.
- patterns/spurious-wakeup-metric — the instrumentation follow-up.
- systems/fly-proxy / systems/rustls / systems/tokio-rustls / systems/tokio — the stack of the canonical instance.
- companies/flyio