Skip to content

CLOUDFLARE 2026-05-14 Tier 1

Read original ↗

Cloudflare — Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse

One-paragraph summary

Cloudflare engineering post (2026-05-14) on a year-long investigation into a hidden bottleneck in ClickHouse's query planner that emerged after Cloudflare extended the partitioning key on their largest analytics table — the 2 PiB multi-tenant Ready-Analytics substrate that hundreds of internal teams (including billing) write to and read from. The original table was partitioned by (day) with a single 31-day retention applied to every tenant. Engineers extended the partitioning key to (namespace, day) so that per-tenant retention became expressible (some tenants needed years for legal / contractual reasons, others needed only days). The migration went through multiple rounds of design review, and the team explicitly reasoned that "since every query is filtered by a specific namespace, the number of parts read by any single query shouldn't change" — and therefore performance shouldn't change either. The reasoning was correct about data scanned and wrong about query duration. Two months after rollout, billing-aggregation jobs (time-critical, hard-deadline) were running progressively slower. None of the usual suspects (I/O, memory, rows scanned, parts read) were elevated. A plot of query duration against total part count in the cluster finally showed the correlation. CPU flame graphs from system.trace_log pointed at filterPartsByPartition (45 % of leaf-query CPU) but the first patch — reordering the heuristics in that function — delivered only a 5 % gain. Switching from CPU traces to "Real" traces (sampling all threads, including those waiting) showed the actual problem: "more than half of leaf query duration is spent waiting for a mutex that protects the list of active parts." Every query planner had to (1) acquire an exclusive lock on MergeTreeData's parts mutex, (2) make a complete copy of the entire parts list, (3) release the lock, (4) filter the copy down. With 30,000 parts per replica at the time of the incident and hundreds of concurrent queries, planners stood in a single-file line. Three patches followed: (1) shared lock instead of exclusive — query planning is read-only on the parts list, so multiple planners can enter the critical section concurrently (massive immediate drop in duration); (2) deferred copy of the parts vector via a "shared copy" cache that read-only operations read from and write operations regenerate on modification — eliminating the per-query tens-of-thousands-element vector copy; (3) binary search on the sorted namespace prefix of the partition key — exploiting the fact that the parts list is already sorted by the partitioning key to skip most of the linear scan in filterPartsByPartition. Patches 1 and 2 are merged upstream as ClickHouse PR #85535 (available since ClickHouse 25.11). Net outcome: at 30k parts/replica when the incident started, billing was missing its hard deadline; a year later at 160k parts/replica, query durations are stable thanks to the optimisations. ZooKeeper part-metadata pressure remains an open problem"Perhaps one day we'll tell the story of the 100 gigabyte ZooKeeper cluster" hints at a structurally related second-order issue not addressed by these patches. The post closes with the honest open question: "Was this partitioning scheme the right long-term choice? Or will we eventually need to bite the bullet and move to a different architecture?"

Key takeaways

  1. The architectural lesson is about hidden costs of partition-count growth, not about partitioning per se. Verbatim assumption from the design review: "since every query is filtered by a specific namespace, the number of parts read by any single query shouldn't change. We believed this meant performance would be unaffected." What the assumption missed is that ClickHouse's query planner does per-cluster work before per-query work — it locks and copies the whole list of parts in the table to decide which ones each query should read. Per-query data scanned was indeed unchanged; per-query planning work scaled with total part count. Canonical wiki instance of concepts/lock-contention-in-query-planning and the broader observation that partition count is a hidden cost axis distinct from rows-read, bytes-read, or rows-per-partition. See patterns/per-tenant-retention-via-partition-key-extension.
  2. Per-tenant retention via a (tenant, day) partitioning key is a clean architectural primitive — but it amplifies partition count by O(tenants × days). Cloudflare's Ready-Analytics had hundreds of namespaces × 31-day retention rolling forward = thousands of partitions per table; with the new scheme some tenants kept years of data, multiplying part counts further. The partitioning- key approach inherits the existing partition-based retention machinery (drop partitions older than N days) for free, but the cost of partition-count growth in the planner becomes load-bearing. Canonical wiki concept: concepts/per-tenant-retention-via-partitioning-key; canonical pattern: patterns/per-tenant-retention-via-partition-key-extension.
  3. The flame-graph diagnostic flip from "CPU" traces to "Real" traces was the breakthrough. First flame graph (CPU-based, samples only active threads): 45 % CPU in filterPartsByPartition. Team patched the function's heuristic order. 5 % win. Real-trace flame graph (samples all threads, including waiting / inactive): "more than half of leaf query duration is spent waiting for a mutex." The CPU trace had been hiding the actual bottleneck because the threads doing the bulk of the waiting weren't accumulating CPU samples — they were blocked on the lock. Canonical wiki concept: concepts/cpu-vs-real-flame-graph — the load-bearing distinction every CCA-state-machine / lock-contention / sleep-cycle bug benefits from.
  4. Optimization 1 — shared lock over exclusive lock for read-only access — was a near-trivial code change with a massive impact. Verbatim: "The query planner doesn't modify the parts list; it just reads it. It had no business using an exclusive lock." The patch swaps std::unique_lock for std::shared_lock; multiple planners now enter the critical section concurrently. "A massive, immediate drop in query duration. The lock contention vanished." Canonical wiki instance of patterns/shared-lock-for-read-only-metadata — the pedagogically obvious-in-retrospect mistake of using a writer-lock when a reader-lock suffices, even on metadata; cross-references concepts/shared-lock-vs-exclusive-lock (the InnoDB wiki page's compatibility-matrix framing applies verbatim to MergeTreeData's C++ shared/unique-lock usage).
  5. Optimization 2 — deferred copy via a cached "shared copy" — eliminates a per-query vector copy. After Opt 1, every planner still copied the entire vector of parts under the shared lock; intuitively cheap, but at tens-of-thousands of parts × hundreds of concurrent planners per second it adds up. The fix: maintain a "shared copy" of the parts list as an internal cache that read-only operations (planners) read from directly, and that write operations (inserts, merges, anything that modifies the parts set) regenerate. Planners then only copy the filtered subset they actually need (typically a few partitions out of thousands), not the full list. Canonical wiki instance of patterns/deferred-copy-cached-collection — a read-mostly-collection variant of copy-on-write that pays the regeneration cost on the rare-write side and yields zero-copy reads on the hot path. Optimizations 1 + 2 were merged upstream as ClickHouse PR #85535 and ship in ClickHouse 25.11.
  6. Optimization 3 — binary search on the sorted namespace prefix — broke the residual correlation between query duration and part count. With Opts 1 + 2 shipped, query durations were lower but still correlated with growing part count because filterPartsByPartition was still doing a linear scan. The insight: the parts list is sorted by the partitioning key, and "the first column of the partition key is namespace, which the vast majority of queries filter on, because it identifies the 'tenant'." Replacing the linear scan with a binary search on the namespace prefix narrows the surviving range to the target tenant's parts before any per-part predicate evaluation. After the March 2026 deployment: "query durations dropped by 50%. More importantly, this finally breaks correlation of query durations with the number of parts." Canonical wiki instance of patterns/binary-search-on-sorted-partition-prefix and structurally adjacent to concepts/partition-pruning (same idea, different altitude — pruning at storage-engine planner level rather than at SQL-optimizer level).
  7. The new pruning approach has a known generalization gap. Verbatim: "Unfortunately, this solution doesn't generalize that well for arbitrary query conditions (e.g. conditions such as namespace in (5, 10)). We are looking into more generic approaches like extending the query condition cache to cover part filtering." Single-tenant equality predicates prune cleanly; multi-tenant IN lists fall back to the linear scan. The honest disclosure of the limit is part of why this post is a useful canonical wiki instance — the optimization is real but not a panacea.
  8. system.trace_log is the canonical investigation substrate. "trace_log is a built-in table that records traces from the running ClickHouse server. It not only includes traces of what code is being executed, but it associates these with specific users, query IDs and other metadata, meaning you can filter down to quite precise sets of events if necessary. In our case, we wanted to look specifically at leaf SELECT queries. This was easy thanks to the available metadata in this table." Filterable by user, query ID, and other metadata; supports both CPU and Real sampling modes; is the foundation for the flame graphs that ranked the bottlenecks. Canonical wiki concept: concepts/clickhouse-trace-log.
  9. The patches are open-source contributions, not a private Cloudflare fork. Verbatim Cloudflare framing: "As with all the patches we make to ClickHouse, we try to make them generic, and eventually get them contributed to the upstream codebase. This makes it easier for us to maintain our fork, and means the community benefits from the changes we make too!" Optimizations 1 + 2 are public via PR #85535 in ClickHouse 25.11. Canonical wiki instance for patterns/upstream-the-fix in the OLAP-database substrate (sibling to the V8 / Node.js / Go / FFmpeg / jemalloc instances already on the page).
  10. The post explicitly leaves architecture-level questions open. Verbatim close: "We've bought ourselves significant breathing room, but the fundamental question remains: Was this partitioning scheme the right long- term choice? Or will we eventually need to bite the bullet and move to a different architecture? For now, our patches are holding, but the experience was a clear example of how even a well-planned change can fall victim to incorrect assumptions." The table-per-namespace alternative was rejected at design time because "thousands of tables on demand" would have required significant new automation — that trade-off is now reopen for re-evaluation as part counts continue to grow toward 160k+ per replica.
  11. The 100 GiB ZooKeeper cluster is the second-order open problem. Verbatim: "In this blog post we've only described the problems increasing part counts had on our select durations, but it has also caused problems for ZooKeeper, which tracks metadata for all the parts in ClickHouse. Perhaps one day we'll tell the story of the 100 gigabyte ZooKeeper cluster." Cross-reference signal for a future Cloudflare follow-up post on the ZooKeeper-as-cluster- metadata substrate at this scale.

Operational numbers

  • Storage scale: ClickHouse cluster fleet stores "over a hundred petabytes of data across a few dozen clusters".
  • Ready-Analytics table size: "more than 2 PiB of data by December 2024".
  • Ready-Analytics ingest rate: "millions of rows per second".
  • Tenant count: "hundreds of applications using it".
  • Standard schema (per record): 20 float fields, 20 string fields, a timestamp, and an indexID.
  • Primary key: (namespace, indexID, timestamp).
  • Original partitioning key: (day).
  • New partitioning key: (namespace, day).
  • Original retention: 31 days, uniform across tenants (the limitation that motivated the migration).
  • Storage management target: 90 % disk utilisation, enforced by a max-min-fairness algorithm that automatically shares unused namespace capacity with namespaces that need more.
  • Migration timeline: design + review late 2024; rollout begins January 2025; billing slowdown surfaces late March 2025; flame-graph investigation runs April–May 2025; Optimizations 1 + 2 deployed; March 2026 Optimization 3 deployed.
  • Part count when billing first reported the slowdown: 30,000 parts per replica.
  • Part count one year later (after all three optimizations): 160,000 parts per replica.
  • CPU-trace flame-graph signature: 45 % of leaf SELECT-query CPU time spent in filterPartsByPartition.
  • First-patch impact (heuristic-reorder in filterPartsByPartition): ~5 % improvement.
  • Real-trace flame-graph signature: >50 % of leaf SELECT-query duration spent waiting on the MergeTreeData parts mutex.
  • Optimization 3 impact: 50 % drop in query durations and breaks correlation of duration with part count.
  • Concurrent-query scale during incident: "tens of thousands of parts and hundreds of concurrent queries… all just standing in a single-file line."
  • Upstream PR: ClickHouse PR #85535.
  • First public ClickHouse release including Optimizations 1 + 2: ClickHouse 25.11.
  • Outstanding ZooKeeper-cluster scale: 100 GiB of ZooKeeper-managed metadata (qualitative; flagged as a future post topic, no further numbers).

Caveats

  • No production-fleet duration distribution disclosed. Improvements are reported as relative drops on average query duration; no p50 / p95 / p99 envelope is shown per-optimization, only "average SELECT durations" plots.
  • No before/after billing-job-completion-time numbers beyond the qualitative "running up against their hard daily deadline." The post does not disclose how close to the deadline the jobs ran post-fix or the SLO margin.
  • Optimization 3's generalization gap is real: multi-tenant IN-list queries fall back to the linear scan; the "query condition cache extension" solution is flagged as in-progress, not shipped. The post does not quantify what fraction of Ready-Analytics workload is multi-tenant.
  • No fork-vs-upstream version disclosure for Opt 3: PR #85535 covers Opts 1 + 2; Optimization 3's contribution status to upstream is not stated. The post implies Cloudflare runs a fork ("easier for us to maintain our fork") but does not name the fork's relationship to mainline beyond Opt 1 + 2.
  • No CloudFlare-side observability of the lock-contention failure mode is described until the Real-trace flame graph; "It took several days before we even had a theory" and "the metrics for individual queries showed they were not reading more data or more parts than before" are the only timing markers. The investigation arc would benefit from a published runbook contribution but the post does not include one.
  • The ZooKeeper substrate problem is named but not characterised. "Perhaps one day we'll tell the story of the 100 gigabyte ZooKeeper cluster" — the part-count- growth pressure on ZooKeeper-managed metadata is named as a real concern with one number (100 GiB) and zero mechanism / mitigation disclosure. Future-post candidate.
  • The architectural alternative (table-per-namespace) is named, dismissed at design time on operational grounds ("thousands of tables on demand"), but not re-evaluated post-incident. The closing question "will we eventually need to bite the bullet and move to a different architecture?" is left open.

Source

Last updated · 542 distilled / 1,571 read