Skip to content

CONCEPT Cited by 2 sources

ClickHouse data part

A data part (or simply part) is the canonical physical storage unit in ClickHouse's MergeTree family of table engines. A part is an immutable, self-contained, on-disk directory that holds:

  • A subset of the table's rows (a contiguous range in the primary-key ordering).
  • All the column files for those rows, each compressed with the column's configured codec.
  • The sparse primary index (the granule- level index that lets queries skip over irrelevant ranges without reading them).
  • Per-column min / max statistics and other metadata.
  • A partition.dat and minmax_*.dat describing which partition the part belongs to and the value range of each partition column.

A table at any moment is the union of its parts.

Lifecycle

  • A part is created on insert (or on merge — see below). Parts are written atomically: a partial directory with a temporary name, then a rename, then registration in the parts index.
  • A part is immutable once written — no in-place row updates. This is what makes MergeTree append-only. Cross-reference: concepts/clickhouse-append-only-tamper-resistance.
  • The MergeTree background merger combines small parts in the same partition into larger parts. Merging is log-structured-merge-style: it produces a new combined part and (eventually) deletes the source parts. The merger throttles itself based on cluster part-count and merge-budget settings.
  • Parts can be detached (moved to detached/ and unregistered) for archival or deletion via ALTER TABLE … DETACH PARTITION.
  • TTL policies (canonical wiki concept: concepts/clickhouse-ttl-policy) drop or move whole parts at merge time when a part's data ages out.

Part vs. partition vs. granule

These three units are easy to confuse:

Unit Definition Granularity Operator-visible
Partition A logical subset of the table whose data shares the partition expression's value (e.g., toYYYYMM(timestamp) = '202506'). Per partition expression. ALTER TABLE … DROP PARTITION '202506'
Part A physical on-disk directory holding rows in a primary-key range, all from one partition. Many parts per partition (created by inserts; merged into fewer parts over time). system.parts table
Granule A logical row-block within a part (default 8,192 rows). The unit of the sparse primary index. Many granules per part. index_granularity setting

Partitions are logical and operator-controlled; parts are physical and merge-controlled; granules are internal to each part for indexing and skipping.

Why part count is a load-bearing axis

Per-query data scanned scales with selected partitions × selected granules per part × parts per partition — and partition pruning + granule skipping keep this small for well-designed queries. But several per-cluster operations scale with total part count:

  • Query planning — the planner takes a lock on the per-table parts list, copies it, and filters it down to the relevant parts for each query. Per-query work includes a linear-or-binary search through the entire parts list. With tens of thousands of parts and hundreds of concurrent queries, planning time becomes a significant fraction of total query duration. Canonical wiki instance: concepts/lock-contention-in-query-planning.
  • Cluster metadata — replicated MergeTree tracks part metadata in ZooKeeper (or ClickHouse Keeper). State size scales linearly with part count. Cloudflare hints at a "100 gigabyte ZooKeeper cluster" as the open second-order problem at high part counts.
  • Background merges — the merger considers all parts in each partition when scheduling merges; per-cluster scheduling scales with part count.
  • Replication catch-up — when a replica catches up after downtime, it has to fetch metadata for every part it's missing.

Designs that increase partitions per table (e.g., per-tenant retention via partitioning key) automatically increase parts per table in the same proportion (because each partition has its own set of parts). The cost is visible only at the per-cluster axes above, not at per-query data-scan metrics — which is precisely why such migrations can pass design review but slow queries down post-rollout.

Operational visibility

ClickHouse's system.parts table exposes one row per part:

  • database, table, partition
  • name — the part's directory name (<partition>_<min_block>_<max_block>_<level>)
  • bytes_on_disk, data_compressed_bytes, data_uncompressed_bytes
  • rows
  • marks — count of granules
  • active — whether the part is currently part of the table (vs. an obsolete merge source pending cleanup)
  • min_time, max_time — value range of the time column

Aggregation queries against system.parts are the canonical diagnostic tool for part-count growth ("how many parts per table?", "how is part-count distributed across replicas?", "what's the 95th percentile partition's part count?"). In Cloudflare's incident the team plotted average SELECT query duration against total parts per replica — the result was an undeniable correlation that pointed at part count as the cost axis.

Seen in

  • sources/2026-05-14-cloudflare-clickhouse-query-plan-contention — Cloudflare's Ready-Analytics partition-count growth from thousands → 30,000 → 160,000 parts per replica drives the canonical wiki instance of part-count-as-hidden-cost-axis. "With tens of thousands of parts and hundreds of concurrent queries, they were all just standing in a single-file line" on the per-table parts mutex during query planning. The post extends the wiki's MergeTree framing with the planner-side cost mechanism that the per-tenant-retention partition-key extension realised.
  • sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse — earlier wiki source on MergeTree's time-partitioning idiom; introduces parts implicitly via the "sub-second latency across months or years" property that depends on part-level granule skipping. The 2026-05-14 Cloudflare source layers the per-cluster cost mechanism onto that framing.
Last updated · 542 distilled / 1,571 read