CONCEPT Cited by 1 source

ClickHouse index granule skipping¶

Definition¶

ClickHouse MergeTree's primary index is a sparse index of ORDER BY-key minimum values stored once per granule (a block of rows, default 8,192 rows). When a query's WHERE predicate is on the ORDER BY key, the engine uses these stored min values to skip whole granules that can't match — without reading their data.

This is what lets ClickHouse "query across months or years of telemetry, logs, or CI/CD events in milliseconds" on tables with tens of billions of rows: most granules are skipped.

Verbatim framing from the 2025-12-09 Redpanda IoT pipeline post:

"ClickHouse stores data in parts, each containing granules (blocks of rows). For each granule, it stores the minimum value of the ORDERED BY key. When you execute a query with a filter, ClickHouse uses the index to skip granules that don't match. This is called MergeTree indexing, and it's why ClickHouse can query across months or years of telemetry, logs, or CI/CD events in milliseconds. You can control how often index entries are written using index_granularity settings. Smaller granules mean finer skipping but more overhead." (Source: sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse)

How it works¶

Granule = block of index_granularity consecutive rows in ORDER BY-key order (default 8,192). Rows within a granule are physically adjacent on disk per column (columnar layout).
Primary index file = one min-value entry per granule per column in the ORDER BY key. Small enough to live in RAM.
Query-time skip = binary-search the primary index for granules whose min-value range could contain matches. Only matching granules are read from disk.

Unlike a B-tree index, this is sparse — it does not index every row, only every index_granularity-th row. This keeps the index small enough to stay hot in memory, at the cost of reading whole granules whenever a matching row is found.

`index_granularity` trade-off¶

Smaller granules (e.g. 1,024 rows) = finer skipping — fewer wasted reads when selectivity is high — but a larger primary index (more entries per part) and more per-granule overhead (codec framing, more merge work).
Larger granules (e.g. 16,384+ rows) = coarser skipping — more wasted reads — but a smaller primary index that stays comfortably in cache.
Default 8,192 is tuned for the common case. Low- selectivity queries (e.g. full scans, column aggregations) are insensitive to the choice; high-selectivity point lookups benefit from smaller granules.

Not a B-tree¶

Two structural consequences follow from sparse-index design:

No random-access point lookups. Querying for a single row by primary key still reads an entire granule — the minimum unit of I/O.
Skips rely on ORDER BY alignment. A WHERE predicate on a column not in the ORDER BY key gets no skip benefit and requires a full scan of matching granules.

Composes with¶

concepts/clickhouse-mergetree-partition-by-time — partition pruning is the coarser outer skip (skip whole partitions); granule skipping is the finer inner skip (skip blocks within partitions).
concepts/cache-locality — granule = physical block on disk + CPU-cache-friendly access when the columnar layout is SIMD-processed linearly.
concepts/simd-vectorization — granule reads feed SIMD processing of column values; the post notes "ClickHouse is also optimized for CPU cache locality, making it ideal for analytical queries."

Caveats¶

Skip index is only on the ORDER BY key prefix. Other predicate columns either need secondary skip indexes (INDEX clauses with minmax, bloom_filter, set, ngrambf_v1 types) or force a full scan.
ORDER BY choice is load-bearing and can't be changed without rewriting the table. Common pattern: ORDER BY (entity_id, timestamp) for per-entity time-range queries.
Inserts that aren't already in ORDER BY-key order trigger merge work to re-sort, which can be expensive at high ingest rates.
No unique constraints, no foreign keys — MergeTree is append-friendly, not OLTP.

Seen in¶

sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse — canonical wiki introduction. Redpanda IoT-pipeline tutorial post names granule-level min-value skipping as the MergeTree mechanism that lets ClickHouse "query across months or years of telemetry, logs, or CI/CD events in milliseconds", and names index_granularity as the knob controlling finer-skip-vs-overhead trade-off.