Skip to content

SYSTEM Cited by 4 sources

ClickHouse

ClickHouse is an open-source columnar OLAP database originally from Yandex, widely used for high-volume analytics and telemetry workloads. Clusters are sharded across nodes; cross-shard queries use distributed tables.

Distributed-query model

A ClickHouse cluster of N shards typically exposes two layers:

  • Distributed tables (in a database conventionally named default) — virtual tables backed by the Distributed table engine that fan out queries to the underlying shard tables on each node.
  • Underlying shard tables (in a database conventionally named r0) — where data is physically stored on each shard.

A client queries the distributed table; the coordinator fans out subqueries to r0.* on each shard node; results merge back.

Historically, distributed subqueries ran under a shared system account. A more fine-grained model runs them under the initiating user's account so that per-user resource limits and grants apply correctly — one bad subquery from a user can no longer affect others. Enabling this model requires granting users explicit access on the r0 underlying tables (they already had implicit access through the distributed tables).

Canonical wiki concept: concepts/clickhouse-distributed-query.

System catalog

ClickHouse exposes metadata tables (system.tables, system.columns, etc.) that list tables / columns visible to the querying user. Metadata visibility follows the user's grants:

  • Without explicit r0 grants: system.columns shows only default columns.
  • With explicit r0 grants: system.columns shows both default and r0 columns.

For a given table name (e.g., http_requests_features), the same row appears once per database namespace the user can see. A query that filters by table but not by database will therefore return a different row count depending on the grants in effect.

MergeTree storage engine (time-series + archival)

The MergeTree family is ClickHouse's default storage engine and the canonical shape for time-series / telemetry / event workloads. Core properties:

  • Partition by timePARTITION BY toYYYYMM(timestamp) or PARTITION BY toDate(timestamp) is the canonical time- series idiom. Partitions are the unit of fast pruning at query time and the unit of archival / eviction. Canonical wiki concept: concepts/clickhouse-mergetree-partition-by-time.
  • Granule-level primary index — ClickHouse stores the minimum value of the ORDER BY key per granule (default 8,192 rows per granule). Query-time granule skipping is what lets MergeTree "query across months or years of telemetry, logs, or CI/CD events in milliseconds" (Source: sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse). Canonical wiki concept: concepts/clickhouse-index-granule-skipping; the index_granularity knob trades finer skipping for more index overhead.
  • Native TTL policiesMODIFY TTL <expr> DELETE (or TO VOLUME 'cold') declaratively evict or move aged rows at merge time. Canonical wiki concept: concepts/clickhouse-ttl-policy.
  • Manual detach archivalALTER TABLE … DETACH PARTITION '202501' moves whole partitions out of the live dataset into detached/ for operator-controlled archival to S3 / NAS. Canonical wiki concept: concepts/clickhouse-detached-partition-archival.
  • Column-level codecsCODEC(LZ4) / CODEC(ZSTD) / CODEC(Delta, ZSTD) / CODEC(Gorilla) set at table creation or via ALTER TABLE … MODIFY COLUMN … CODEC(...). Enables hot-cold codec tiering — lightweight codec on hot SSD, aggressive codec on cold S3.
  • Append-only model — no row-level UPDATE / DELETE in the normal path (only via TTL or explicit expensive mutations). Yields weak tamper resistance for audit logs; strong guarantees need an external hash ledger layered on top. Canonical wiki concept: concepts/clickhouse-append-only-tamper-resistance.

Query execution — SIMD + cache locality

ClickHouse's performance on analytical aggregation rests on two structural choices:

Together these give ClickHouse its "sub-second latency across months or years" property on MergeTree tables.

Time-series use cases (2025-12-09 Redpanda disclosure)

The 2025-12-09 Redpanda IoT-pipeline tutorial post names three canonical use cases for time-partitioned MergeTree:

  • Historical sensor trends — aggregate / filter / visualise time-series patterns across days/months/years with SQL (partition pruning + granule skipping make long-range queries cheap).
  • ML training data — columnar table stores raw or preprocessed features; partition by time + order by entity-id lets training code extract labeled datasets, join across sources, and aggregate efficiently.
  • Audit logs — the append-only property plus external hash ledger (see concepts/clickhouse-append-only-tamper-resistance) makes MergeTree a viable audit-log substrate.

The same post also names ClickHouse + Snowflake dual-tier as the pattern when a pipeline needs both real-time governance (Snowflake) and long-term columnar retention (ClickHouse) — with Redpanda Connect's parallel outputs as the fan-out primitive.

Gap: no first-class Redpanda Connect output (2025-12)

At the time of the 2025-12-09 Redpanda post, Redpanda Connect does not have a dedicated ClickHouse output connector analogous to snowflake_streaming. The canonical pipeline uses generic sql_raw / sql_insert processors targeting ClickHouse via its SQL interface — flexible, but not a columnar-native bulk-insert path. "While there isn't a dedicated ClickHouse connector for Redpanda Connect yet, the sql_raw and sql_insert components allow you to stream execute commands or stream data from Redpanda into ClickHouse." (Source: sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse)

Query planner — parts-list lock contention at high part count (2026-05-14 Cloudflare disclosure)

A 2026-05-14 Cloudflare retrospective canonicalised the MergeTreeData parts-list mutex as the load-bearing internal bottleneck once part count grows beyond tens of thousands per replica. The mechanism (Source: sources/2026-05-14-cloudflare-clickhouse-query-plan-contention):

To plan a query, every thread acquired an exclusive lock on the table's parts mutex, made a complete copy of the entire parts list (a std::vector of part descriptors), released the lock, and only then filtered the copy down to the parts the query actually needed. With tens of thousands of parts and hundreds of concurrent queries, planners stood in a single-file line on the mutex.

Diagnostic signature: the failure is invisible to per- query metrics (rows scanned, bytes read, parts read are unchanged) and invisible to CPU flame graphs (waiting threads aren't sampled). It is visible only on Real flame graphs (sample all threads regardless of state), where >50 % of leaf-SELECT duration shows as waiting on the parts mutex. Canonical wiki concept: concepts/cpu-vs-real-flame-graph — the diagnostic flip from CPU to Real sampling is the load-bearing investigation move. Substrate: system.trace_log supports both modes via a single configuration toggle.

Failure class: canonical wiki concept concepts/lock-contention-in-query-planning — per-cluster control-plane work scales with total part count, not with per-query data scanned. Designs that grow part count without changing per-query data (e.g., per-tenant retention via partition-key extension with PARTITION BY (tenant, day)) realise the cost.

Three-patch mitigation stack (Cloudflare):

  1. Shared lock for the parts mutex — query planners only read the parts list, so multiple planners can enter the critical section concurrently. "A massive, immediate drop in query duration. The lock contention vanished."
  2. Deferred- copy cached snapshot of the parts list — read-only operations read from a shared cached snapshot directly; modifying operations regenerate it. Planners only copy the filtered list of parts they actually need, not the full list.
  3. Binary search on the sorted namespace prefix of the partition key — the parts list is already sorted by the partitioning key; queries that filter on the leading prefix (the dominant case in multi-tenant analytics) can binary-search to narrow the candidate range before any per-part predicate evaluation. Final patch in March 2026 "breaks correlation of query durations with the number of parts."

Optimizations 1 + 2 ship upstream as ClickHouse PR #85535 in ClickHouse 25.11 — canonical patterns/upstream-the-fix instance for the ClickHouse codebase. Optimization 3's upstream contribution status is not stated.

Open second-order problem: the same part-count growth puts pressure on ZooKeeper, which tracks per-part metadata for replicated MergeTree. Cloudflare flagged a "100 gigabyte ZooKeeper cluster" as the open follow-up issue.

Seen in

  • sources/2026-05-14-cloudflare-clickhouse-query-plan-contention — canonical wiki instance of planner-side lock contention at high part count in MergeTreeData. Cloudflare's Ready-Analytics (2 PiB, hundreds of namespaces, millions of rows/sec) extended its partitioning key from (day) to (namespace, day) to enable per-tenant retention. The migration left per-query data scanned unchanged but multiplied total part count from thousands to 30,000 → 160,000 parts/replica over a year. Per-query duration grew with cluster part count because every planner serialised on an exclusive lock around the table's parts mutex, then copied the full vector. Three-patch fix landed: shared-lock + cached parts-list snapshot + binary search on the namespace prefix of the partition key. PR #85535 (Opts 1 + 2) shipped in ClickHouse 25.11. Names the rejected table-per-namespace alternative and flags the ZooKeeper-metadata pressure ("100 GiB ZooKeeper cluster") as the open follow-up. The post is the wiki's foundational source for MergeTreeData planner internals (filterPartsByPartition, system.trace_log CPU vs Real sampling, the CPU/Real flame-graph distinction). Sibling to the 2025-11-18 Cloudflare Bot Management outage source — same Cloudflare blog, same operator, same ClickHouse cluster fleet, different internal-mechanism story (distributed-query grants vs parts-mutex contention).
  • sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse — canonical wiki introduction of ClickHouse's MergeTree time-partitioning idiom + TTL + detach-archival + codec tiering + granule-skipping + append-only audit substrate. Redpanda IoT-pipeline tutorial (2025-12-09) frames ClickHouse as the long-term columnar archive leg of the dual-tier streaming pipeline paired with Snowflake for short-term real-time analytics. Worked schema example (telemetry_events with PARTITION BY toYYYYMM(timestamp) + TTL INTERVAL 12 MONTH DELETE + CODEC(ZSTD) on the cold value column); canonical hot/cold codec pairing (LZ4 on SSD / ZSTD on S3).
  • sources/2025-11-18-cloudflare-outage-on-november-18-2025 — canonical wiki instance. Cloudflare was mid-rollout of the per-user-account distributed-query model. The new grants made r0.http_requests_features visible via system.columns to users who previously saw only default.http_requests_features. A downstream consumer (Bot Management's feature-file generator) ran SELECT ... FROM system.columns WHERE table = 'http_requests_features' without filtering by database, so the result-row count roughly doubled on migrated nodes. The doubled-size feature file then broke the preallocated 200-feature cap in the FL2 core proxy — triggering a 3-hour core-traffic outage.
  • — canonical wiki instance as aggregate-telemetry storage backend for PlanetScale Insights. Data flow: Postgres extension emits two parallel streams (individual-query + aggregate-summary) to Kafka topics; aggregate-stream messages land in ClickHouse and power "the majority of Insights including the query table, anomalies, and all query-related graphs." Positions ClickHouse as the default choice for per-query-pattern tag-dimension aggregation in a high-volume telemetry pipeline — the columnar OLAP shape is a natural fit for the per-tag-combination aggregate schema.
Last updated · 542 distilled / 1,571 read