Skip to content

PATTERN Cited by 1 source

Flamegraph to upstream fix

Intent

End-to-end debugging discipline for the class of incidents where a CPU-burning, low-work symptom in your service is caused by a bug several layers deep in an open-source dependency. The arc:

  1. Incident: tripwire fires (CPU %, error rate, some downstream SLO). Tactical mitigation (process bounce, traffic shedding, partner stops load-testing) buys time but does not resolve.
  2. Flamegraph from an angry host. Read the rank-ordering of hot frames.
  3. Shape-recognise the pathology. Infrastructure dominating the profile + low kernel time + monotonically pegged CPU → busy-loop / CPU busy-loop.
  4. Type-signature narrow. Fully-qualified async-Rust / templated-C++ / generic-Scala frames give you the full wrapper chain around the bug. Audit your own wrappers first (recent changes, reproducibility), isolate the one foreign dependency left.
  5. Find the issue in the upstream tracker (often it's known — Rustls had tokio-rustls#72 open for this class of Waker bug).
  6. Fix it upstream. Submit a PR to the dependency, not a patch to your own fork.
  7. Validate in the real world. Resume the trigger (partner load test, replayed traffic) on the patched build; confirm the symptom doesn't come back.
  8. Instrument for next time (patterns/spurious-wakeup-metric and/or targeted profile-on-alarm): the bug class is supposed to be rare, so cheap detection should exist.

Canonical instance

Fly.io, 2025-02 (Source: sources/2025-02-26-flyio-taming-a-voracious-rust-proxy):

  1. IAD edge hosts pegged CPU + HTTP errors.
  2. Pavel pulled a flamegraph.
  3. tracing::Subscriber dominance → "something's haywire" → busy-loop signature.
  4. The nested Future type named tokio_rustls::server::TlsStream <…> as the one non-own-code layer left.
  5. Pre-existing issue tokio-rustls#72 matched the profile — close_notify with buffered trailer.
  6. rustls PR #1950 shipped the fix upstream.
  7. Tigris (the partner whose load test triggered it) resumed — "no spin-outs."
  8. Fly.io committed to adding a spurious-wakeup metric in their own instrumentation going forward.

Why upstream, not around

For the specific case of rustls, the Fly.io post is blunt: "TlsStream is an ultra-important, load-bearing function in the Rust ecosystem. Everybody uses it." The bug is an ecosystem liability, not a Fly.io liability. Patching your own fork optimises for this service today and leaves the landmine for every other Rust-TLS user. The upstream-the-fix pattern recognises that when the bug is in a shared primitive, your fix is leverage proportional to the primitive's footprint.

Applicability

  • Strongly applicable: production services on load-bearing-ecosystem async runtimes (Tokio, asyncio, Node.js, Go runtime), TLS libraries (systems/rustls, OpenSSL, BoringSSL), generic-C++ compile chains (templated containers, allocators).
  • Less applicable: your service is the application; the bug is at your altitude; no-one downstream of you would inherit.

Sibling patterns

Seen in

  • sources/2025-02-26-flyio-taming-a-voracious-rust-proxy — canonical wiki instance. Symptom (CPU + HTTP-errors in IAD) → flamegraph from angry proxy → tracing::Subscriber hot frames → Future type points at tokio_rustls::TlsStream → known issue on upstream → PR submitted → partner load-test resumed clean.
Last updated · 200 distilled / 1,178 read