CONCEPT Cited by 1 source
ClickHouse trace_log¶
system.trace_log
is a built-in ClickHouse system table
that records stack-trace samples from the running
ClickHouse server. It is the canonical investigation
substrate for ClickHouse performance work — the table that
flame-graph generation queries against, and the table whose
metadata makes precise per-user / per-query / per-leaf-query
analysis possible.
Schema (load-bearing columns)¶
system.trace_log rows include (among other fields):
event_time/event_time_microseconds— when the sample was captured.trace_type— what kind of sampling produced this row. The two operationally important modes are:CPU— sampled when the thread is on-CPU. Width represents CPU time. Misses waiting threads. Default for many setups.Real— sampled at wall-clock intervals, regardless of thread state. Width represents wall-clock time and captures blocked threads. Canonical wiki concept: concepts/cpu-vs-real-flame-graph.- Other types include
Memory,MemorySample,Profile— for memory and profile-event tracing. thread_id,query_id,user— metadata that lets you scope the sample to a specific running request. This metadata is what makestrace_logmore useful than rawperfoutput: you can filter to "all samples from useranalyst42running queryq-abc-123" with a single SQLWHEREclause.trace— an array of program counters for the sampled stack. Joined againstsystem.trace_log_*/ symbol tables to render frame names.
Why metadata is the differentiator¶
Standard system profilers (Linux perf, eBPF
profile/offcputime) capture per-thread stacks but have
no application-level concept of "query" or "user". To scope
a profile to a specific query in a generic perf setup you
either: (a) profile only when that query runs (impractical
at production query rate), or (b) capture everything and
filter post-hoc with thread-ID-to-query joins (operationally
heavy, and breaks down for thread-pooled execution).
trace_log solves this at the source: the sampler is
inside ClickHouse, the query / user / thread context is
known at sample time, and the resulting table is a
SQL-queryable substrate. Cloudflare's investigation
(canonical wiki source:
sources/2026-05-14-cloudflare-clickhouse-query-plan-contention)
exploits exactly this affordance:
"It not only includes traces of what code is being executed, but it associates these with specific users, query IDs and other metadata, meaning you can filter down to quite precise sets of events if necessary. In our case, we wanted to look specifically at leaf SELECT queries. This was easy thanks to the available metadata in this table."
A leaf SELECT in ClickHouse is a sub-query running on a
shard as part of a distributed query — distinct from the
coordinator-side initiating query. Filtering trace_log to
just leaf SELECTs (via the is_initial_query /
initial_query_id metadata) was the move that gave
Cloudflare a clean per-shard CPU/Real picture without
coordinator noise.
Generating flame graphs from trace_log¶
The standard pattern is:
- Enable trace sampling at the desired rate (e.g.,
query_profiler_real_time_period_ns = 10000000for 100 Hz Real-mode sampling). - Run / observe the workload.
- SQL-query
system.trace_logjoined with symbol tables to produce folded-stack format:
SELECT
arrayStringConcat(arrayMap(
x -> demangle(addressToSymbol(x)), trace), ';') AS stack,
count() AS samples
FROM system.trace_log
WHERE event_time >= now() - INTERVAL 5 MINUTE
AND trace_type = 'Real'
AND query_id IN (...)
GROUP BY stack
ORDER BY samples DESC
- Pipe folded stacks into Brendan Gregg's
flamegraph.plor Speedscope to render the SVG.
The shape of the rendered flame graph is identical between CPU and Real modes — what differs is whether width represents CPU time (active threads only) or wall-clock time (all threads). See concepts/cpu-vs-real-flame-graph for why this matters.
Operational notes¶
- Sampling rate is configurable per-session via
query_profiler_real_time_period_nsandquery_profiler_cpu_time_period_ns. Higher rates capture more detail but increase per-query overhead andtrace_logwrite rate. - Retention of
trace_logis configured via the table's TTL — production deployments typically retain hours to days of samples. - Sample volume can be large: a busy server emitting 100 Hz Real samples per thread × hundreds of threads produces tens of thousands of rows per second. The table is itself a MergeTree, so the substrate scales — but retention should be tuned to the investigation horizon.
- Symbol availability depends on the ClickHouse build: release builds with stripped symbols give addresses without function names; debug-symbol builds give full demangled names.
Adjacent ClickHouse system tables¶
trace_log sits in a small family of system tables that
together form the investigation substrate:
system.query_log— one row per query, with execution metadata (user, duration, rows, bytes, profile events, query plan).system.processes— currently running queries, live.system.parts— physical-storage state (used to plot part-count vs. duration in Cloudflare's investigation). Cross-reference: concepts/clickhouse-data-part.system.metric_log/system.events— server-level metrics over time.system.opentelemetry_span_log— OTel span exports for query traces.
The combination of query_log (per-query metadata) +
trace_log (per-stack-sample with query attribution) +
parts (physical-storage state) is what made Cloudflare's
year-long ClickHouse-internals investigation tractable from
SQL alone, without attaching perf to running processes.
Seen in¶
- sources/2026-05-14-cloudflare-clickhouse-query-plan-contention
— canonical wiki instance. The flame-graph evidence at
every step of Cloudflare's investigation came from
system.trace_log: the first CPU flame graph attributed 45 % of leaf-SELECT CPU tofilterPartsByPartition; the second Real flame graph showed >50 % of leaf-SELECT duration waiting on theMergeTreeDataparts mutex; the post-Optimization-1 Real flame graph showed the bottleneck shift from lock-wait to vector-copy in the shared critical section; the post-Optimization-2 trace showed the residual time infilterPartsByPartitionmotivating the Optimization 3 binary-search rewrite. Cloudflare explicitly names the user / query-ID / leaf- query metadata as what made "filter down to quite precise sets of events" easy.
Related¶
- systems/clickhouse — substrate.
- concepts/cpu-vs-real-flame-graph — the diagnostic
distinction
trace_logfirst-classes. - concepts/flamegraph-profiling — the rendering layer on top.
- concepts/lock-contention-in-query-planning — the
failure class
trace_log-derived Real flame graphs surface. - concepts/stack-trace-sampling-profiling — the underlying sampling primitive.