Skip to content

CONCEPT Cited by 1 source

ClickHouse index granule skipping

Definition

ClickHouse MergeTree's primary index is a sparse index of ORDER BY-key minimum values stored once per granule (a block of rows, default 8,192 rows). When a query's WHERE predicate is on the ORDER BY key, the engine uses these stored min values to skip whole granules that can't match — without reading their data.

This is what lets ClickHouse "query across months or years of telemetry, logs, or CI/CD events in milliseconds" on tables with tens of billions of rows: most granules are skipped.

Verbatim framing from the 2025-12-09 Redpanda IoT pipeline post:

"ClickHouse stores data in parts, each containing granules (blocks of rows). For each granule, it stores the minimum value of the ORDERED BY key. When you execute a query with a filter, ClickHouse uses the index to skip granules that don't match. This is called MergeTree indexing, and it's why ClickHouse can query across months or years of telemetry, logs, or CI/CD events in milliseconds. You can control how often index entries are written using index_granularity settings. Smaller granules mean finer skipping but more overhead." (Source: sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse)

How it works

  • Granule = block of index_granularity consecutive rows in ORDER BY-key order (default 8,192). Rows within a granule are physically adjacent on disk per column (columnar layout).
  • Primary index file = one min-value entry per granule per column in the ORDER BY key. Small enough to live in RAM.
  • Query-time skip = binary-search the primary index for granules whose min-value range could contain matches. Only matching granules are read from disk.

Unlike a B-tree index, this is sparse — it does not index every row, only every index_granularity-th row. This keeps the index small enough to stay hot in memory, at the cost of reading whole granules whenever a matching row is found.

index_granularity trade-off

  • Smaller granules (e.g. 1,024 rows) = finer skipping — fewer wasted reads when selectivity is high — but a larger primary index (more entries per part) and more per-granule overhead (codec framing, more merge work).
  • Larger granules (e.g. 16,384+ rows) = coarser skipping — more wasted reads — but a smaller primary index that stays comfortably in cache.
  • Default 8,192 is tuned for the common case. Low- selectivity queries (e.g. full scans, column aggregations) are insensitive to the choice; high-selectivity point lookups benefit from smaller granules.

Not a B-tree

Two structural consequences follow from sparse-index design:

  1. No random-access point lookups. Querying for a single row by primary key still reads an entire granule — the minimum unit of I/O.
  2. Skips rely on ORDER BY alignment. A WHERE predicate on a column not in the ORDER BY key gets no skip benefit and requires a full scan of matching granules.

Composes with

  • concepts/clickhouse-mergetree-partition-by-time — partition pruning is the coarser outer skip (skip whole partitions); granule skipping is the finer inner skip (skip blocks within partitions).
  • concepts/cache-locality — granule = physical block on disk + CPU-cache-friendly access when the columnar layout is SIMD-processed linearly.
  • concepts/simd-vectorization — granule reads feed SIMD processing of column values; the post notes "ClickHouse is also optimized for CPU cache locality, making it ideal for analytical queries."

Caveats

  • Skip index is only on the ORDER BY key prefix. Other predicate columns either need secondary skip indexes (INDEX clauses with minmax, bloom_filter, set, ngrambf_v1 types) or force a full scan.
  • ORDER BY choice is load-bearing and can't be changed without rewriting the table. Common pattern: ORDER BY (entity_id, timestamp) for per-entity time-range queries.
  • Inserts that aren't already in ORDER BY-key order trigger merge work to re-sort, which can be expensive at high ingest rates.
  • No unique constraints, no foreign keys — MergeTree is append-friendly, not OLTP.

Seen in

Last updated · 470 distilled / 1,213 read