CONCEPT Cited by 1 source
ClickHouse index granule skipping¶
Definition¶
ClickHouse MergeTree's primary index is a sparse index of
ORDER BY-key minimum values stored once per granule (a
block of rows, default 8,192 rows). When a query's WHERE
predicate is on the ORDER BY key, the engine uses these
stored min values to skip whole granules that can't match —
without reading their data.
This is what lets ClickHouse "query across months or years of telemetry, logs, or CI/CD events in milliseconds" on tables with tens of billions of rows: most granules are skipped.
Verbatim framing from the 2025-12-09 Redpanda IoT pipeline post:
"ClickHouse stores data in parts, each containing granules (blocks of rows). For each granule, it stores the minimum value of the
ORDERED BYkey. When you execute a query with a filter, ClickHouse uses the index to skip granules that don't match. This is called MergeTree indexing, and it's why ClickHouse can query across months or years of telemetry, logs, or CI/CD events in milliseconds. You can control how often index entries are written usingindex_granularitysettings. Smaller granules mean finer skipping but more overhead." (Source: sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse)
How it works¶
- Granule = block of
index_granularityconsecutive rows inORDER BY-key order (default 8,192). Rows within a granule are physically adjacent on disk per column (columnar layout). - Primary index file = one min-value entry per granule
per column in the
ORDER BYkey. Small enough to live in RAM. - Query-time skip = binary-search the primary index for granules whose min-value range could contain matches. Only matching granules are read from disk.
Unlike a B-tree index, this is sparse — it does not index
every row, only every index_granularity-th row. This keeps
the index small enough to stay hot in memory, at the cost of
reading whole granules whenever a matching row is found.
index_granularity trade-off¶
- Smaller granules (e.g. 1,024 rows) = finer skipping — fewer wasted reads when selectivity is high — but a larger primary index (more entries per part) and more per-granule overhead (codec framing, more merge work).
- Larger granules (e.g. 16,384+ rows) = coarser skipping — more wasted reads — but a smaller primary index that stays comfortably in cache.
- Default 8,192 is tuned for the common case. Low- selectivity queries (e.g. full scans, column aggregations) are insensitive to the choice; high-selectivity point lookups benefit from smaller granules.
Not a B-tree¶
Two structural consequences follow from sparse-index design:
- No random-access point lookups. Querying for a single row by primary key still reads an entire granule — the minimum unit of I/O.
- Skips rely on
ORDER BYalignment. AWHEREpredicate on a column not in theORDER BYkey gets no skip benefit and requires a full scan of matching granules.
Composes with¶
- concepts/clickhouse-mergetree-partition-by-time — partition pruning is the coarser outer skip (skip whole partitions); granule skipping is the finer inner skip (skip blocks within partitions).
- concepts/cache-locality — granule = physical block on disk + CPU-cache-friendly access when the columnar layout is SIMD-processed linearly.
- concepts/simd-vectorization — granule reads feed SIMD processing of column values; the post notes "ClickHouse is also optimized for CPU cache locality, making it ideal for analytical queries."
Caveats¶
- Skip index is only on the
ORDER BYkey prefix. Other predicate columns either need secondary skip indexes (INDEXclauses withminmax,bloom_filter,set,ngrambf_v1types) or force a full scan. ORDER BYchoice is load-bearing and can't be changed without rewriting the table. Common pattern:ORDER BY (entity_id, timestamp)for per-entity time-range queries.- Inserts that aren't already in
ORDER BY-key order trigger merge work to re-sort, which can be expensive at high ingest rates. - No unique constraints, no foreign keys — MergeTree is append-friendly, not OLTP.
Seen in¶
- sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse
— canonical wiki introduction. Redpanda IoT-pipeline
tutorial post names granule-level min-value skipping as
the MergeTree mechanism that lets ClickHouse "query
across months or years of telemetry, logs, or CI/CD events
in milliseconds", and names
index_granularityas the knob controlling finer-skip-vs-overhead trade-off.