CONCEPT Cited by 1 source
CPU vs Real flame graph¶
A CPU flame graph and a Real (wall-clock) flame graph are two distinct profiling modes that look superficially identical (same horizontal-stacks-of-frames visualisation) but answer different questions about the same running program. Conflating the two is the canonical source of "my flame graph shows 45 % of CPU in this function but optimising it only saved 5 %" false leads.
The distinction¶
- CPU flame graph (the default, e.g. Linux
perf recordwithout flags, Brendan Gregg's classic flame graphs). Samples a thread's stack only when the thread is on-CPU. The width of each frame represents CPU time spent with that frame on the stack. Threads that are idle, blocked on a mutex, blocked on I/O, blocked on a syscall, or sleeping contribute zero samples during their wait. - Real flame graph (also called "wall-clock" or off-CPU + on-CPU). Samples every thread on each sampling tick, regardless of state. Width represents wall-clock time (or thread-time) spent with that frame on the stack — including time spent waiting.
Same stacks; different denominators.
When the distinction is load-bearing¶
- CPU profile answers: "where is CPU going?"
- Real profile answers: "where is time going?"
If a workload's bottleneck is CPU-bound (compute, parsing, serialisation, encryption), the two profiles look similar and the CPU profile is sufficient.
If a workload's bottleneck is wait-bound (lock contention, blocking I/O, network round-trips, GC stop-the-world pauses, condition variables, page faults, futex waits), the two profiles can be wildly different. The CPU profile shows only the small fraction of work the runnable threads happen to be doing; the Real profile shows the large fraction of time the blocked threads are stuck waiting.
Cloudflare's canonical case¶
Cloudflare's investigation into ClickHouse query-planner slowdown (canonical wiki source: sources/2026-05-14-cloudflare-clickhouse-query-plan-contention) illustrates the diagnostic flip almost perfectly:
"We had been generating 'CPU' traces, which only sample active threads. We switched to 'Real' traces, which sample all threads, including those that are inactive or waiting. The new flame graph was a revelation."
First flame graph (CPU mode): 45 % of leaf SELECT CPU
time spent in filterPartsByPartition. Looks like a CPU
bottleneck. Cloudflare's first patch optimises that function's
predicate ordering: 5 % improvement. Real-trace flame
graph: >50 % of leaf SELECT duration spent waiting on
the MergeTreeData parts mutex — a function that doesn't even
appear in the CPU profile because the threads holding the
critical section are quick (the work is fast) and the
threads waiting aren't using CPU.
The CPU profile was correct about where CPU was going. It was silent about where time was going. The five-percent first patch is the canonical "correct fix to the wrong problem" artifact this distinction exists to prevent.
Operational rules of thumb¶
Use these as priors when investigating a slow path:
- Total CPU utilisation low + queries slow → wait-bound; pull a Real flame graph immediately. Common causes: lock contention, blocking I/O, network waits.
- Total CPU utilisation high + queries slow → CPU-bound; the CPU flame graph is likely sufficient.
- A CPU-side fix yields significantly less than the flame graph predicted → pull a Real flame graph; the delta-explaining work is in a wait state somewhere.
- Concurrency-sensitive slowdown (slow under load, fast under single-user replay) → almost always contention; a CPU profile of the multi-user case will understate by a wide margin.
Substrate variants by ecosystem¶
Different runtimes expose the distinction differently:
| Substrate | CPU mode | Real mode |
|---|---|---|
| Linux perf | perf record -F 99 -g |
perf record -F 99 -g --all-cpus + off-CPU eBPF (offcputime) |
| ClickHouse | system.trace_log with trace_type = 'CPU' |
trace_type = 'Real' |
| Go | runtime/pprof CPU profile |
execution traces (go tool trace) for blocking; goroutineblockprofile |
| Java | async-profiler -e cpu |
async-profiler -e wall |
| Node.js | --prof / V8 sampler |
clinic-flame's wall mode |
| Python | py-spy --idle (omit --idle for CPU only) |
py-spy --idle includes blocked threads |
ClickHouse's system.trace_log
makes the distinction first-class: a single setting toggles
between CPU and Real sampling, and the underlying flame-graph
generation is identical in shape — just sampled differently.
This is exactly why Cloudflare's switch was easy: same tooling,
same query interface, different sampling mode.
Related concepts¶
- Off-CPU profiling (Brendan Gregg's coinage) — the
formal name for the wait-time half that Real profiles
capture. Linux
bcc/offcputime/bcc/offwaketimeare canonical eBPF tools. futex_waitsignature — in Real profiles, mutex contention shows up as time spent infutex_waitcalls (Linux) or equivalent kernel primitive on other OSes. Recognising this as the smoking gun for lock contention is a learned pattern.- Sampled vs instrumented profiling — the CPU/Real distinction is orthogonal to concepts/instrumented-vs-sampling-profile; both modes here are sampling profiles, just with different sample triggers.
- Continuous profiling at production scale — modern continuous-profiling systems (Pyroscope, Parca, Datadog Profiler) expose both modes; queries against the profile store let you compare. See concepts/continuous-profiling.
Seen in¶
- sources/2026-05-14-cloudflare-clickhouse-query-plan-contention
— canonical wiki instance. CPU flame graph showed 45 % of
leaf SELECT CPU in
filterPartsByPartition; first patch to that function delivered only 5 %; switching to Real flame graph showed >50 % of leaf SELECT duration waiting on theMergeTreeDataparts mutex — a function invisible in the CPU profile. Three follow-up patches (shared lock + deferred-copy snapshot + binary search on sorted prefix) targeted the actual bottleneck. The diagnostic flip from CPU to Real flame graphs is named in the post as the load-bearing investigation move.
Related¶
- concepts/flamegraph-profiling — the visualisation substrate the two modes share.
- concepts/lock-contention-in-query-planning — the failure class Real flame graphs surface.
- concepts/clickhouse-trace-log — the substrate that exposed the diagnostic in Cloudflare's case.
- concepts/stack-trace-sampling-profiling — the underlying sampling primitive.
- concepts/instrumented-vs-sampling-profile — the orthogonal axis.
- systems/clickhouse — substrate.
- patterns/shared-lock-for-read-only-metadata — the fix the diagnostic enabled.