Skip to content

PATTERN Cited by 1 source

Binary search on sorted partition prefix

Intent

When a database planner has to filter a sorted collection of physical-storage units (parts, segments, files) by a predicate on the leading prefix of the sort key, replace the linear scan through the whole collection with a binary search on the leading prefix to narrow the candidate range first. Per-call cost drops from O(N) to O(log N + k) where k is the number of matching partitions.

The pattern is the storage-engine-internals analogue of SQL- optimizer-level partition pruning — same idea, different altitude. SQL-level pruning narrows partitions before the storage engine is asked; this pattern narrows the parts list the planner walks inside the storage engine.

Canonical instance: ClickHouse filterPartsByPartition

After Cloudflare landed Optimization 1 (shared lock) and Optimization 2 (deferred-copy snapshot), query durations were lower but still correlated with growing part count:

"We're still not done. As part counts grow, performance still degrades, just much more slowly. The correlation with part count was still there. Coming back to this after a few months, a new flame graph (looking the same as Figure 3) shows the time is spent in the filtering code path (the one we tried to fix first). This code performs a linear scan over all parts, evaluating predicates against each one. Over a few months, we were back to select durations from before the optimizations." (Source: sources/2026-05-14-cloudflare-clickhouse-query-plan-contention)

The insight:

"But we know this list of parts is sorted by the partitioning key. Remember that the first column of the partition key is namespace, which the vast majority of queries filter on, because it identifies the 'tenant.' How can we make use of this?"

The fix:

"We implemented a binary search based on the namespace part of the partition ID. This works because the vector is sorted, so you can filter out a lot of the entries without actually looking at them. This is particularly effective since the namespace is the first part of that sorting key. After this first-pass of binary search, we have a much smaller range of parts we need to examine, and for those we still step through each one, applying the same logic as before to exclude parts based on other conditions."

Result: "After deploying this patch in March 2026, query durations dropped by 50%. More importantly, this finally breaks correlation of query durations with the number of parts."

When the pattern applies

The fix is correct iff:

  • The collection is sorted by the partition key. This is the structural prerequisite — without sort order, no binary search is possible. ClickHouse's MergeTree parts list is sorted by partition expression by construction.
  • The query predicate is on the leading prefix of the sort key. A binary search on the namespace prefix works because the parts list is sorted by (namespace, day). A predicate on day alone could not benefit from the same search; same as the leftmost-prefix rule for composite indexes.
  • The predicate has a finite, narrow range of matching values. An equality predicate (namespace = 'X') maps to a contiguous run of parts; binary search finds the run ends in O(log N). An IN list with a small number of values can do k separate binary searches. A predicate like namespace LIKE 'A%' may also be expressible as a range search.
  • Linear-scan post-filtering still catches the rest. The binary-search step narrows to the candidate range; per-part predicate evaluation continues for the remaining filters (e.g. day = '2026-05-01'). The pattern doesn't replace the per-part predicate logic — it just shrinks the input.

The known generalization gap

Cloudflare names the pattern's limitation explicitly:

"Unfortunately, this solution doesn't generalize that well for arbitrary query conditions (e.g. conditions such as namespace in (5, 10)). We are looking into more generic approaches like extending the query condition cache to cover part filtering."

  • Equality on the leading prefix (namespace = 'X'): one binary search, O(log N + k_X). ✅ pattern wins.
  • Small IN list on the leading prefix (namespace IN ('X', 'Y')): k binary searches, O(k × log N). ✅ pattern wins for small k.
  • Range on leading prefix (namespace BETWEEN 'A' AND 'M'): two binary searches for the range bounds, O(log N + matching-parts). ✅ pattern wins.
  • Predicate on non-leading column (day = '2026-05- 01'): no binary search possible on the leading prefix; fallback to linear scan. ❌ pattern doesn't apply.
  • Predicate involving a function of the leading column (hash(namespace) = X): can't binary-search; fallback. ❌ pattern doesn't apply.
  • No predicate on the leading column (cross-tenant aggregation): scans every part. ❌ pattern doesn't apply.

The same constraint shape appears at SQL-optimizer level in concepts/partition-pruning — pattern works for direct equality / range predicates on the partition expression, not for function-of-column or non-leading predicates. The storage-engine-internals analogue inherits the same limitations.

Substrate variants

Substrate Where the binary search lives
ClickHouse MergeTree filterPartsByPartition in the planner; binary-searches the in-memory parts list sorted by partition expression (Cloudflare's Opt 3)
InnoDB B-tree index lookup is itself binary-search-on-sorted-prefix; the pattern is implicit in any index seek
Lakehouse table formats Manifest files are sorted by partition value; reader-side manifest pruning binary-searches the partition column to narrow the file set
LSM-tree compaction Per-level sorted run; range queries binary-search to find the starting position before scanning the matching range
In-memory sorted indexes generally std::lower_bound / Arrays.binarySearch / equivalent

The pattern is universal across sorted-collection storage substrates; the load-bearing observation in the canonical instance is that it can be retrofitted to an existing linear- scan implementation when the collection happens to already be sorted but the search wasn't exploiting the sort.

  • SQL-level partition pruning (concepts/partition-pruning) — the SQL-optimizer altitude. Same idea: use predicate metadata to narrow the candidate set before scanning.
  • Composite-index column ordering (concepts/composite-index-column-order / concepts/leftmost-prefix-rule) — same constraint structure: leading-prefix predicates can use the index; non-leading predicates can't.
  • Skip lists / B+ trees — the data-structure-level primitives that bake binary-search-on-sorted-prefix into their access protocol.

Hard problems

  • Generalisation to arbitrary predicates — Cloudflare's partial-cache approach (extending the query-condition cache) is one direction. A different approach is to precompute per-partition-column min/max statistics and let the planner intersect them with the predicate; this is the approach Iceberg / Delta Lake take with manifest-file column statistics.
  • Maintenance under writes — the parts list must remain sorted as inserts and merges add new parts. In ClickHouse's case the list is sorted by construction; insert positions are determined and the cached snapshot (Opt 2) is regenerated. The binary-search property holds on the snapshot.
  • Per-query overhead vs. setup cost — for very small parts lists the binary-search overhead may exceed linear-scan cost. Implementation should fall back to linear scan below a threshold (typical pattern).

Seen in

  • sources/2026-05-14-cloudflare-clickhouse-query-plan-contention — canonical wiki instance. Cloudflare's Optimization 3 on top of Opts 1 + 2: replace the linear scan in filterPartsByPartition with a binary search on the namespace prefix of the partition ID. Deployed March 2026; 50 % drop in query durations and breaks the correlation between query duration and total part count, finally. Cloudflare names the pattern's generalisation gap (namespace IN (5, 10)) and the forward-looking direction (extending the query condition cache).
Last updated · 542 distilled / 1,571 read