Skip to content

CONCEPT Cited by 1 source

ClickHouse trace_log

system.trace_log is a built-in ClickHouse system table that records stack-trace samples from the running ClickHouse server. It is the canonical investigation substrate for ClickHouse performance work — the table that flame-graph generation queries against, and the table whose metadata makes precise per-user / per-query / per-leaf-query analysis possible.

Schema (load-bearing columns)

system.trace_log rows include (among other fields):

  • event_time / event_time_microseconds — when the sample was captured.
  • trace_type — what kind of sampling produced this row. The two operationally important modes are:
  • CPU — sampled when the thread is on-CPU. Width represents CPU time. Misses waiting threads. Default for many setups.
  • Real — sampled at wall-clock intervals, regardless of thread state. Width represents wall-clock time and captures blocked threads. Canonical wiki concept: concepts/cpu-vs-real-flame-graph.
  • Other types include Memory, MemorySample, Profile — for memory and profile-event tracing.
  • thread_id, query_id, user — metadata that lets you scope the sample to a specific running request. This metadata is what makes trace_log more useful than raw perf output: you can filter to "all samples from user analyst42 running query q-abc-123" with a single SQL WHERE clause.
  • trace — an array of program counters for the sampled stack. Joined against system.trace_log_* / symbol tables to render frame names.

Why metadata is the differentiator

Standard system profilers (Linux perf, eBPF profile/offcputime) capture per-thread stacks but have no application-level concept of "query" or "user". To scope a profile to a specific query in a generic perf setup you either: (a) profile only when that query runs (impractical at production query rate), or (b) capture everything and filter post-hoc with thread-ID-to-query joins (operationally heavy, and breaks down for thread-pooled execution).

trace_log solves this at the source: the sampler is inside ClickHouse, the query / user / thread context is known at sample time, and the resulting table is a SQL-queryable substrate. Cloudflare's investigation (canonical wiki source: sources/2026-05-14-cloudflare-clickhouse-query-plan-contention) exploits exactly this affordance:

"It not only includes traces of what code is being executed, but it associates these with specific users, query IDs and other metadata, meaning you can filter down to quite precise sets of events if necessary. In our case, we wanted to look specifically at leaf SELECT queries. This was easy thanks to the available metadata in this table."

A leaf SELECT in ClickHouse is a sub-query running on a shard as part of a distributed query — distinct from the coordinator-side initiating query. Filtering trace_log to just leaf SELECTs (via the is_initial_query / initial_query_id metadata) was the move that gave Cloudflare a clean per-shard CPU/Real picture without coordinator noise.

Generating flame graphs from trace_log

The standard pattern is:

  1. Enable trace sampling at the desired rate (e.g., query_profiler_real_time_period_ns = 10000000 for 100 Hz Real-mode sampling).
  2. Run / observe the workload.
  3. SQL-query system.trace_log joined with symbol tables to produce folded-stack format:
SELECT
  arrayStringConcat(arrayMap(
    x -> demangle(addressToSymbol(x)), trace), ';') AS stack,
  count() AS samples
FROM system.trace_log
WHERE event_time >= now() - INTERVAL 5 MINUTE
  AND trace_type = 'Real'
  AND query_id IN (...)
GROUP BY stack
ORDER BY samples DESC
  1. Pipe folded stacks into Brendan Gregg's flamegraph.pl or Speedscope to render the SVG.

The shape of the rendered flame graph is identical between CPU and Real modes — what differs is whether width represents CPU time (active threads only) or wall-clock time (all threads). See concepts/cpu-vs-real-flame-graph for why this matters.

Operational notes

  • Sampling rate is configurable per-session via query_profiler_real_time_period_ns and query_profiler_cpu_time_period_ns. Higher rates capture more detail but increase per-query overhead and trace_log write rate.
  • Retention of trace_log is configured via the table's TTL — production deployments typically retain hours to days of samples.
  • Sample volume can be large: a busy server emitting 100 Hz Real samples per thread × hundreds of threads produces tens of thousands of rows per second. The table is itself a MergeTree, so the substrate scales — but retention should be tuned to the investigation horizon.
  • Symbol availability depends on the ClickHouse build: release builds with stripped symbols give addresses without function names; debug-symbol builds give full demangled names.

Adjacent ClickHouse system tables

trace_log sits in a small family of system tables that together form the investigation substrate:

  • system.query_log — one row per query, with execution metadata (user, duration, rows, bytes, profile events, query plan).
  • system.processes — currently running queries, live.
  • system.parts — physical-storage state (used to plot part-count vs. duration in Cloudflare's investigation). Cross-reference: concepts/clickhouse-data-part.
  • system.metric_log / system.events — server-level metrics over time.
  • system.opentelemetry_span_log — OTel span exports for query traces.

The combination of query_log (per-query metadata) + trace_log (per-stack-sample with query attribution) + parts (physical-storage state) is what made Cloudflare's year-long ClickHouse-internals investigation tractable from SQL alone, without attaching perf to running processes.

Seen in

  • sources/2026-05-14-cloudflare-clickhouse-query-plan-contention — canonical wiki instance. The flame-graph evidence at every step of Cloudflare's investigation came from system.trace_log: the first CPU flame graph attributed 45 % of leaf-SELECT CPU to filterPartsByPartition; the second Real flame graph showed >50 % of leaf-SELECT duration waiting on the MergeTreeData parts mutex; the post-Optimization-1 Real flame graph showed the bottleneck shift from lock-wait to vector-copy in the shared critical section; the post-Optimization-2 trace showed the residual time in filterPartsByPartition motivating the Optimization 3 binary-search rewrite. Cloudflare explicitly names the user / query-ID / leaf- query metadata as what made "filter down to quite precise sets of events" easy.
Last updated · 542 distilled / 1,571 read