Skip to content

CONCEPT Cited by 1 source

Lock contention in query planning

Lock contention in query planning is the failure class where a database's per-cluster metadata lock — typically protecting the table's list of physical storage units (parts, files, segments, regions) — becomes the bottleneck for every concurrent query because each query's planner has to acquire the lock to decide what to read. The contention is invisible to per-query metrics (rows scanned, bytes read, parts read are unchanged) and only surfaces at the wall-clock duration axis when concurrency × per-cluster work product exceeds the lock's throughput.

The shape

A planner runs before execution. To plan a query, the planner needs to know what physical storage units are part of the table right now — the answer changes constantly as inserts create new parts, merges combine parts, retention drops parts, replicas catch up. The list is held in memory, behind a mutex, because concurrent writers would otherwise race.

A typical implementation:

  1. Planner acquires a lock on the parts mutex.
  2. Planner makes a copy of the parts list (so it can release the lock and not block writers for the duration of planning).
  3. Planner releases the lock.
  4. Planner filters the copy down to the parts the query actually needs (partition pruning, predicate evaluation per part, etc.).
  5. Planner returns the filtered set to the executor.

When the lock is exclusive (every planner takes it exclusively even though it's only reading), planners serialise — each planner waits for the previous one to finish step 1–3 before it can enter. With N concurrent planners all needing the same read-only access, the critical-section throughput is 1 / per-planner-lock-time, not N / per-planner-lock-time. At sufficient concurrency the queue depth grows unboundedly.

Even when the lock is shared (correctly modelled as a read lock for read-only access), the per-planner copy of the parts list can become the bottleneck if the list is large enough — copying an N-element vector costs O(N) per planner per query, which at high N × high QPS dominates the critical-section work.

Cloudflare's canonical instance

Cloudflare's Ready-Analytics table grew from PARTITION BY (day) to PARTITION BY (namespace, day) to enable per-tenant retention. The migration multiplied total part count by the namespace count. By the time billing-aggregation jobs missed their deadline, part count was at 30,000 parts per replica. ClickHouse's MergeTreeData parts mutex was used as an exclusive lock by every query planner, even though planning only reads the parts list:

"More than half of our query duration was spent waiting to acquire a single mutex (MergeTreeData) that protects the table's list of parts. To plan a query, every single thread had to: 1. Acquire an exclusive lock on this mutex. 2. Make a complete copy of the list of all parts in the table. 3. Release the lock. 4. Filter that list down to the relevant parts. With tens of thousands of parts and hundreds of concurrent queries, they were all just standing in a single-file line."

(Source: sources/2026-05-14-cloudflare-clickhouse-query-plan-contention)

Why it's hard to see

Three properties make this failure class systemically under-detected:

  1. Per-query metrics look fine. Rows scanned, bytes read, parts read, partitions read — all unchanged. The query is doing the same work; it's just waiting longer to start. Dashboards built around per-query resource counters miss the contention entirely.
  2. CPU profiles miss the wait. Standard CPU sampling profiles only sample threads that are currently running. Threads waiting on a mutex aren't running, so they don't contribute samples. The CPU flame graph shows only the work the non-blocked threads are doing — in Cloudflare's case, 45 % of CPU time in the per-part predicate-evaluation function. That looks like a CPU bottleneck and motivates a CPU-side fix that barely moves the needle (Cloudflare's first patch: reorder predicates in filterPartsByPartition, 5 % improvement). The actual bottleneck is in the wait time of the threads not in the CPU profile.
  3. Wall-clock profiles surface it instantly. Switching from CPU sampling to "Real" sampling (sample all threads, including those waiting / inactive) makes the lock contention immediately visible. Canonical wiki instance: concepts/cpu-vs-real-flame-graph.

Diagnostic: real-time vs CPU flame graphs

The diagnostic flip is the load-bearing tool for this class:

  • CPU flame graph — samples the active stack on each CPU. Shows where CPU is going. Misses idle / waiting threads.
  • Real / Wall-clock flame graph — samples all threads on every interval, regardless of state. Shows where wall-clock time is going across the entire process.

If a CPU flame graph shows X % in a function but the fix yields a fraction of that improvement, always pull a Real flame graph before designing the next patch — the missing delta is almost certainly in wait_for_lock or futex_wait somewhere outside the CPU profile.

Mitigations

The full mitigation stack against this failure class is a sequence:

  1. Use the right lock kind for the access pattern. A planner reading the parts list does not need an exclusive lock — a shared lock allows multiple planners to enter the critical section concurrently. Canonical wiki pattern: patterns/shared-lock-for-read-only-metadata; substrate concept: concepts/shared-lock-vs-exclusive-lock.
  2. Eliminate the per-query copy. Even with a shared lock, copying the entire parts list per query costs O(N) per planner per query. A deferred-copy cached collection (an internal snapshot that read-only ops read from directly, regenerated on writes) lets planners read the snapshot without copying. Canonical wiki pattern: patterns/deferred-copy-cached-collection.
  3. Reduce per-planner scan work. Even without copying, the planner has to filter the list down to the relevant parts. If the parts list is sorted by the partitioning key (which it is, in a MergeTree-style storage engine) and queries filter on a prefix of that key, binary search on the prefix replaces the linear scan. Canonical wiki pattern: patterns/binary-search-on-sorted-partition-prefix.

In Cloudflare's case all three landed: Optimization 1 (shared lock) + Optimization 2 (deferred copy) + Optimization 3 (binary search). Optimizations 1 + 2 are upstream as ClickHouse PR #85535 in ClickHouse 25.11. Optimization 3 is described in the post; the upstream contribution status is not stated.

After all three: at 160k parts per replica (5x the incident-trigger part count), query durations are stable.

Adjacent at other altitudes

  • InnoDB metadata lock contention during DDL — same shape (exclusive lock blocks readers), different metadata (table schema vs. parts list). See concepts/online-ddl for the InnoDB-altitude framing and gh-ost / pt-osc mitigations.
  • Postgres pg_class / catalog contention at high table count — multi-tenant systems with thousands of tables hit catalog-bloat issues; canonical wiki concept: concepts/catalog-bloat-multi-tenant. Same root failure shape — control-plane state grows with tenant count even when per-query work doesn't.
  • Kubernetes API server etcd contention at high object count — every controller's planning work scales with the total object inventory; etcd watch-event throughput dominates at scale.
  • Service mesh control-plane — the xDS server's per- proxy push work scales with the total endpoint / cluster / route count, regardless of how many endpoints any single proxy actually uses.

The unifying observation is that control-plane work tends to scale with total inventory size, not with per-consumer inventory usage — and is therefore amplified by every design choice that grows the inventory (more parts, more tables, more endpoints, more tenants) even when those choices leave per-consumer workload unchanged.

Seen in

  • sources/2026-05-14-cloudflare-clickhouse-query-plan-contention — Cloudflare's year-long investigation into the ClickHouse MergeTreeData parts-mutex contention that emerged after extending Ready-Analytics' partitioning key. Canonical wiki instance: 30k → 160k parts/replica, 45 % of leaf SELECT CPU in filterPartsByPartition (CPU flame graph signature), >50 % of leaf SELECT duration waiting on the mutex (Real flame graph signature), three-patch mitigation stack (shared-lock + deferred-copy + binary-search), upstream PR #85535 in ClickHouse 25.11.
Last updated · 542 distilled / 1,571 read