Skip to content

CONCEPT Cited by 3 sources

Flamegraph profiling

Definition

Flamegraph profiling is the practice of sampling a running process's stack at high frequency and rendering the aggregated stacks as a flamegraph — a horizontal-axis-is-samples, vertical-axis-is-stack-depth visualisation where the width of each frame is proportional to the time spent with that frame on the stack. Invented / popularised by Brendan Gregg.

The point is not the image per se but the rank-ordering of where CPU goes. Bugs of the "something is burning a core but we don't know what" shape — canonically CPU busy-loop incidents — are diagnosed almost entirely by reading the top of the flamegraph.

Async-state-machine signature

In async-Rust / Tokio stacks, the tell-tale signature of a spurious-wakeup busy-loop is that the flamegraph is dominated by infrastructure, not business logic:

  • tracing::Subscriber::enter / exit frames (span enter/exit is supposed to be very fast)
  • Tokio poll frames with no meaningful work beneath them
  • libc syscalls that return almost immediately without doing I/O

As Fly.io describes:

"If the mere act of entering a span in a Tokio stack is chewing up a significant amount of CPU, something has gone haywire: the actual code being traced must be doing next to nothing."

The inversion — infrastructure in the hot path, business logic invisible — is the fingerprint.

Using the type signature

Modern languages with strong generic monomorphisation (Rust, Scala, templated C++) emit flamegraph frames with the fully-qualified type of each stack frame. For async Rust that often means the whole nested-Future type shows up as a single frame. Fly.io's 2025-02 case:

&mut fp_io::copy::Duplex<&mut fp_io::reusable_reader::ReusableReader<
  fp_tcp::peek::PeekableReader<
    tokio_rustls::server::TlsStream<
      fp_tcp_metered::MeteredIo<
        fp_tcp::peek::PeekableReader<
          fp_tcp::permitted::PermittedTcpStream>>>>>,
  connect::conn::Conn<tokio::net::tcp::stream::TcpStream>>

Reading this top-to-bottom gives the exact wrapper chain around the bug — and since Fly's own wrappers (Duplex, ReusableReader, PeekableReader, MeteredIo, PermittedTcpStream) could be audited for recent changes + reproducibility, the suspect narrowed to one foreign layer: tokio-rustls::TlsStream.

Second use: rank-ordering optimization targets

Beyond the "something is on fire" use case, flamegraphs are the primary instrument for picking the right thing to optimize. On a large service, thousands of functions run — only a handful cost enough CPU to justify engineering time. The flamegraph's width ordering makes this pick deterministic: start at the top.

Netflix Ranker's 2026-03 video- serendipity-scoring optimization started with exactly this step:

"When we looked at CPU profiles for this service, one feature kept standing out: video serendipity scoring — the logic that answers a simple question: 'How different is this new title from what you've been watching so far?' This single feature was consuming about 7.5% of total CPU on each node running the service."

"A flamegraph made it clear: One of the top hotspots in the service was Java dot products inside the serendipity encoder. Algorithmically, the hotspot was a nested loop structure of M candidates × N history items where each pair generates its own cosine similarity — i.e. O(M×N) separate dot product operations."

(Source: sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api)

The flamegraph answered two questions:

  1. Which function to optimize (serendipity encoder at 7.5% of CPU, not some other candidate).
  2. Which operation inside that function dominated (dot products in a nested loop — a structural hint that told Netflix to reshape the computation rather than improve the inner loop).

This is the canonical wiki instance of the flamegraph as the target-selection instrument at the start of a measurement- driven optimization loop, distinct from the Fly.io case where the flamegraph diagnosed a bug (infrastructure-in-hot-path fingerprint of spurious-wakeup busy-loop).

Seen in

  • sources/2025-02-26-flyio-taming-a-voracious-rust-proxy — Pavel on Fly.io's proxy team pulled a flamegraph from an angry fly-proxy; tracing::Subscriber dominance was the "something is wrong" indicator; the Future type signature pointed at tokio_rustls::server::TlsStream as the guilty layer. Textbook flamegraph-as-diagnostic.
  • sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-apiFlamegraph-as-target-selector. Netflix Performance Engineering used CPU profiles on Ranker to identify video- serendipity scoring at 7.5% of total node CPU and further localised the cost to Java dot products in a nested M×N loop structure — which led to the batched-matmul reshape and JDK Vector API kernel swap. Canonical wiki instance of flamegraph- drove target selection at the start of the measurement loop.
  • sources/2026-04-21-vercel-making-turborepo-96-faster-with-agents-sandboxes-and-humansFlamegraph-as-agent-consumption-format-problem. Anthony Shew's 2026-04-21 Turborepo performance retrospective canonicalises the format-for-agent-consumption axis of profile-driven optimisation: the same underlying flame-graph span data in Chrome Trace Event Format JSON (Perfetto-loadable, UI-optimised) vs a companion Markdown version (line-per-record, grep- friendly, agent-optimised) produces "radically better optimization suggestions" from the same model + agent harness. Opens the agent-reader altitude of flamegraph consumption alongside the prior human-reader altitudes (Fly.io 2025-02 diagnostic, Netflix 2026-03 target selection). See patterns/markdown-profile-output-for-agents for the companion-emit pattern.

  • sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technologyhyperscale producer altitude. Meta's Strobelight is the canonical fleet- orchestrated flame-graph substrate: 42+ profilers running on every production host feed flame-graph-ready data into Scuba within seconds of capture, with frame pointers enabled fleet-wide and delayed symbolization via a central service. First canonical wiki instance of flame-graph profiling at hyperscaler fleet scale (tens-of-thousands of services, every host always-on) — contrast with the Fly.io / Cloudflare / Netflix / Vercel instances which are per-service or per-team. Extended by Stack Schemas (query-time tagging) + Strobemeta (sample-time request-context attach) so a flame-graph can be filtered to "p99 requests on endpoint X" without post-hoc trace joins.

Last updated · 542 distilled / 1,571 read