Skip to content

CONCEPT Cited by 1 source

CPU vs Real flame graph

A CPU flame graph and a Real (wall-clock) flame graph are two distinct profiling modes that look superficially identical (same horizontal-stacks-of-frames visualisation) but answer different questions about the same running program. Conflating the two is the canonical source of "my flame graph shows 45 % of CPU in this function but optimising it only saved 5 %" false leads.

The distinction

  • CPU flame graph (the default, e.g. Linux perf record without flags, Brendan Gregg's classic flame graphs). Samples a thread's stack only when the thread is on-CPU. The width of each frame represents CPU time spent with that frame on the stack. Threads that are idle, blocked on a mutex, blocked on I/O, blocked on a syscall, or sleeping contribute zero samples during their wait.
  • Real flame graph (also called "wall-clock" or off-CPU + on-CPU). Samples every thread on each sampling tick, regardless of state. Width represents wall-clock time (or thread-time) spent with that frame on the stack — including time spent waiting.

Same stacks; different denominators.

When the distinction is load-bearing

  • CPU profile answers: "where is CPU going?"
  • Real profile answers: "where is time going?"

If a workload's bottleneck is CPU-bound (compute, parsing, serialisation, encryption), the two profiles look similar and the CPU profile is sufficient.

If a workload's bottleneck is wait-bound (lock contention, blocking I/O, network round-trips, GC stop-the-world pauses, condition variables, page faults, futex waits), the two profiles can be wildly different. The CPU profile shows only the small fraction of work the runnable threads happen to be doing; the Real profile shows the large fraction of time the blocked threads are stuck waiting.

Cloudflare's canonical case

Cloudflare's investigation into ClickHouse query-planner slowdown (canonical wiki source: sources/2026-05-14-cloudflare-clickhouse-query-plan-contention) illustrates the diagnostic flip almost perfectly:

"We had been generating 'CPU' traces, which only sample active threads. We switched to 'Real' traces, which sample all threads, including those that are inactive or waiting. The new flame graph was a revelation."

First flame graph (CPU mode): 45 % of leaf SELECT CPU time spent in filterPartsByPartition. Looks like a CPU bottleneck. Cloudflare's first patch optimises that function's predicate ordering: 5 % improvement. Real-trace flame graph: >50 % of leaf SELECT duration spent waiting on the MergeTreeData parts mutex — a function that doesn't even appear in the CPU profile because the threads holding the critical section are quick (the work is fast) and the threads waiting aren't using CPU.

The CPU profile was correct about where CPU was going. It was silent about where time was going. The five-percent first patch is the canonical "correct fix to the wrong problem" artifact this distinction exists to prevent.

Operational rules of thumb

Use these as priors when investigating a slow path:

  • Total CPU utilisation low + queries slow → wait-bound; pull a Real flame graph immediately. Common causes: lock contention, blocking I/O, network waits.
  • Total CPU utilisation high + queries slow → CPU-bound; the CPU flame graph is likely sufficient.
  • A CPU-side fix yields significantly less than the flame graph predicted → pull a Real flame graph; the delta-explaining work is in a wait state somewhere.
  • Concurrency-sensitive slowdown (slow under load, fast under single-user replay) → almost always contention; a CPU profile of the multi-user case will understate by a wide margin.

Substrate variants by ecosystem

Different runtimes expose the distinction differently:

Substrate CPU mode Real mode
Linux perf perf record -F 99 -g perf record -F 99 -g --all-cpus + off-CPU eBPF (offcputime)
ClickHouse system.trace_log with trace_type = 'CPU' trace_type = 'Real'
Go runtime/pprof CPU profile execution traces (go tool trace) for blocking; goroutineblockprofile
Java async-profiler -e cpu async-profiler -e wall
Node.js --prof / V8 sampler clinic-flame's wall mode
Python py-spy --idle (omit --idle for CPU only) py-spy --idle includes blocked threads

ClickHouse's system.trace_log makes the distinction first-class: a single setting toggles between CPU and Real sampling, and the underlying flame-graph generation is identical in shape — just sampled differently. This is exactly why Cloudflare's switch was easy: same tooling, same query interface, different sampling mode.

  • Off-CPU profiling (Brendan Gregg's coinage) — the formal name for the wait-time half that Real profiles capture. Linux bcc/offcputime / bcc/offwaketime are canonical eBPF tools.
  • futex_wait signature — in Real profiles, mutex contention shows up as time spent in futex_wait calls (Linux) or equivalent kernel primitive on other OSes. Recognising this as the smoking gun for lock contention is a learned pattern.
  • Sampled vs instrumented profiling — the CPU/Real distinction is orthogonal to concepts/instrumented-vs-sampling-profile; both modes here are sampling profiles, just with different sample triggers.
  • Continuous profiling at production scale — modern continuous-profiling systems (Pyroscope, Parca, Datadog Profiler) expose both modes; queries against the profile store let you compare. See concepts/continuous-profiling.

Seen in

  • sources/2026-05-14-cloudflare-clickhouse-query-plan-contention — canonical wiki instance. CPU flame graph showed 45 % of leaf SELECT CPU in filterPartsByPartition; first patch to that function delivered only 5 %; switching to Real flame graph showed >50 % of leaf SELECT duration waiting on the MergeTreeData parts mutex — a function invisible in the CPU profile. Three follow-up patches (shared lock + deferred-copy snapshot + binary search on sorted prefix) targeted the actual bottleneck. The diagnostic flip from CPU to Real flame graphs is named in the post as the load-bearing investigation move.
Last updated · 542 distilled / 1,571 read