SYSTEM Cited by 1 source

Cloudflare Ready-Analytics¶

Ready-Analytics is Cloudflare's internal multi-tenant analytics platform, built on top of ClickHouse in early 2022 to simplify onboarding for our many internal teams. Rather than each Cloudflare team designing, provisioning, and operating its own ClickHouse table for analytics workloads, Ready-Analytics provides one massive table that all teams stream into, with per-team data disambiguated by a namespace column.

By December 2024 the table held more than 2 PiB of data at an ingestion rate of millions of rows per second, with hundreds of internal applications as tenants — including the billing pipeline that produces "most of Cloudflare's bills." (Source: sources/2026-05-14-cloudflare-clickhouse-query-plan-contention)

Schema¶

Every record uses a standard schema:

20 float fields
20 string fields
A timestamp
An indexID string
A namespace string identifying the tenant

The primary key is (namespace, indexID, timestamp). This composition is load-bearing:

The namespace prefix is the tenant-isolation axis — every query filters by namespace, so per-tenant data sits contiguously in storage.
The indexID segment is the per-tenant secondary ordering axis — it lets each tenant choose how its data is sorted within its namespace partition (by user ID, request ID, error code, etc.) so that tenant-specific query patterns benefit from the granule- skipping property of MergeTree's sparse primary index.
The timestamp suffix is the time-range pruning axis, which composes with the time-based partitioning (see below) to keep range queries cheap.

Original partitioning + retention (pre-2025)¶

Cloudflare had been using ClickHouse since "before it had native Time-to-Live (TTL) features" (cross-reference: concepts/clickhouse-ttl-policy) — long enough that the existing retention substrate was built on partitioning rather than TTL. Ready-Analytics' original PARTITION BY was just (day); a separate retention job dropped any partition older than 31 days.

This "one-size-fits-all" 31-day retention was the table's critical operational flaw: tenants with legal / contractual obligations to retain data for years could not use Ready-Analytics, and tenants who only needed a few days were forced into the 31-day quota. The non-fitting tenants opted out into bespoke ClickHouse tables — re-creating the onboarding-complexity problem Ready-Analytics existed to solve.

The migration: per-tenant retention via partitioning-key extension¶

In late 2024 Cloudflare designed and reviewed a migration that would let Ready-Analytics support per-namespace retention. Two architectural alternatives were on the table:

Table-per-namespace — give each tenant its own ClickHouse table. Cleanly solves retention (each table has its own TTL policy) but requires "significant new automation to manage thousands of tables on demand." Operationally heavier; was rejected.
Extend the partitioning key from (day) to (namespace, day). Existing retention machinery (drop partitions older than N days) keeps working but now with per- namespace granularity. Operationally lighter; was chosen.

Cloudflare chose option 2 and ran the migration starting January 2025, using ClickHouse's Merge table feature to combine the old and new tables while old data aged out: "writing all new data to the new partitioned table while the old data aged out."

This is the canonical instance of patterns/per-tenant-retention-via-partition-key-extension; see also concepts/per-tenant-retention-via-partitioning-key.

Storage management — max-min-fairness at 90 % utilisation¶

The new partitioning scheme also enabled a sophisticated storage-management layer. Cloudflare's design:

Set a target disk utilisation of 90 %.
Use a max-min-fairness algorithm to "share" available space across namespaces.
Namespaces using less than their fair share automatically cede their unused capacity to namespaces that need more.

This is what lets Cloudflare "confidently run our clusters at 90% utilisation" — high enough to be operationally meaningful, low enough that fairness-driven rebalancing absorbs short-term tenant bursts without operator intervention.

The hidden cost: planner lock contention¶

The migration's load-bearing assumption at design review was that per-query work would be unaffected because every query filters by namespace, so the number of parts read by any single query wouldn't change. The assumption was correct about per-query data scanned and wrong about per-query duration, because ClickHouse's query planner does per-cluster work (lock + copy the entire parts list) before per-query work (filter the copy down). With the extended partitioning key, the total part count grew — linearly with the number of namespaces × time partitions — and planner work grew with it.

Two months after rollout the billing pipeline (time-critical, hard daily deadline) started missing its window. By the time the team plotted query duration vs. cluster part count, the correlation was undeniable: 30,000 parts per replica at the failure point. See:

concepts/lock-contention-in-query-planning — the failure class.
concepts/cpu-vs-real-flame-graph — the diagnostic trick that exposed the lock contention.
concepts/clickhouse-trace-log — the substrate that fed the flame graphs.

The architectural lesson: partition count is a hidden cost axis distinct from rows-read, bytes-read, or rows-per- partition. A migration that leaves per-query data unchanged can still slow queries down dramatically if it inflates total partition count.

Outcome — three optimizations to the upstream planner¶

Cloudflare landed three patches against ClickHouse's MergeTreeData to break the planner bottleneck:

Shared lock for read-only planner access — patterns/shared-lock-for-read-only-metadata. Massive immediate drop in query duration.
Deferred-copy cached parts list — patterns/deferred-copy-cached-collection. Read-only operations read from a snapshot; writes regenerate. Eliminates the per-query vector-copy.
Binary search on the sorted namespace prefix — patterns/binary-search-on-sorted-partition-prefix. Exploits the fact that the parts list is sorted by the partitioning key to skip most of the linear scan. Breaks the residual correlation between query duration and part count.

Optimizations 1 + 2 ship upstream as ClickHouse PR #85535 in ClickHouse 25.11 — canonical patterns/upstream-the-fix instance.

A year after the incident — at 160,000 parts per replica — query durations are stable, vindicating the optimization stack but leaving open the architectural question: "Was this partitioning scheme the right long-term choice? Or will we eventually need to bite the bullet and move to a different architecture?"

Open second-order problem: ZooKeeper part metadata¶

Beyond the planner duration issue, the same part-count growth has caused problems for ZooKeeper, which Cloudflare's ClickHouse cluster uses to track metadata for all parts. Cloudflare hints at a future post: "Perhaps one day we'll tell the story of the 100 gigabyte ZooKeeper cluster." The substrate problem is named with one number and zero mechanism disclosure — flagged here as the open follow-up issue not addressed by the three planner-side optimizations.

Seen in¶

sources/2026-05-14-cloudflare-clickhouse-query-plan-contention — Cloudflare engineering retrospective on the year-long investigation. Canonicalises Ready-Analytics as the multi-tenant analytics substrate, the schema (20 floats / 20 strings / timestamp / indexID, primary key (namespace, indexID, timestamp)), the original (day) + uniform 31-day retention design, the migration to (namespace, day) partitioning + per-tenant retention, the max-min-fairness 90 % storage utilisation policy, the Merge-table-based old-to-new cutover, the 2 PiB scale point at December 2024, the "hundreds of applications" tenant count, "millions of rows per second" ingest, and the part-count growth from 30k → 160k parts/replica over the incident year. Names the table-per-namespace alternative as the rejected design and the open architectural question as still unsettled at post time.

systems/clickhouse — the underlying database; this source canonicalises the planner-lock-contention failure class against MergeTree at this scale.
systems/apache-zookeeper — the part-metadata substrate named as the open second-order problem.
concepts/per-tenant-retention-via-partitioning-key — the design idiom Ready-Analytics adopted.
concepts/clickhouse-data-part — the unit whose count growth drove the bottleneck.
concepts/max-min-fairness-storage-fair-share — the fair-share storage policy.
concepts/lock-contention-in-query-planning — the failure class.
patterns/per-tenant-retention-via-partition-key-extension — the load-bearing architectural pattern.
companies/cloudflare — operator.