Skip to content

CONCEPT Cited by 1 source

Per-tenant retention via partitioning key

Per-tenant retention via partitioning key is the multi-tenant-analytics design idiom of extending the partitioning key with a tenant-identifier column so that each tenant's data lives in its own set of partitions, and existing partition-based retention machinery can apply different retention policies to different tenants without moving any data or running any per-tenant migration.

Canonical wiki instance: Cloudflare's Ready-Analytics migration from PARTITION BY (day) (uniform 31-day retention across hundreds of tenants) to PARTITION BY (namespace, day) (per-tenant retention from days to years), preserving the existing "drop partitions older than N days" retention job.

The problem

Multi-tenant analytics tables typically share infrastructure across many tenants — Cloudflare's Ready-Analytics has "hundreds of applications" writing to a single 2 PiB table. Retention is the canonical multi-tenant complication:

  • Some tenants need years of data retention (legal / contractual / compliance).
  • Some tenants need days (operational telemetry / short-term debugging).
  • Some need a fixed cap (storage-budget-driven).
  • A uniform retention policy across all tenants is always wrong for someone — either over-retaining (cost) or under-retaining (the use case stops working).

A uniform retention policy is what the table-design substrate makes natural: PARTITION BY (day) lets the retention job drop whole partitions atomically (canonical ClickHouse DETACH PARTITION / drop-partition substrate), but every tenant's data in a given day-partition shares that partition's lifecycle. There's no tenant-level handle on that partition.

The fix: extend the partitioning key with the tenant column

Replace PARTITION BY (day) with PARTITION BY (tenant_col, day). Each tenant now owns its own set of partitions (one per (tenant, day) tuple), and the existing partition-drop retention machinery extends naturally:

  • The retention job iterates partitions and drops those older than each tenant's configured retention.
  • No per-tenant migration is needed — tenants who keep the default (e.g. 31 days) see no change; tenants who need longer retention have their partitions kept; tenants who need shorter have theirs dropped sooner.
  • Existing per-query routing and pruning machinery works identically — every query already filters by the tenant column (because that's how multi-tenant queries work), so partition pruning still narrows to the relevant tenant's partitions before scanning.

In ClickHouse-specific terms, this adds a time-partitioned MergeTree dimension orthogonal to the existing time axis: parts are now indexed by the (tenant, day) tuple rather than just the day. Pruning works on either column or both.

The hidden cost: partition-count growth

The cost is partition count grows by O(tenants × days) where it used to grow by O(days). The data per-partition shrinks by roughly the same factor (each partition now holds one tenant's day rather than all tenants' day), but the part count in the cluster grows linearly with the tenant × retention-days product.

This matters because some database operations scale with part count rather than with rows / bytes / per-query data scanned. Specifically:

  • Query planning in ClickHouse acquires a lock on the table's parts list, copies the whole list, and filters it down to the relevant partitions for each query. Per-query data scanned is unchanged (every query still filters by tenant); per-query planning time grows with total part count. Canonical wiki instance: concepts/lock-contention-in-query-planning.
  • Cluster-metadata stores (e.g. ZooKeeper for ClickHouse in Cloudflare's deployment) track per-part metadata. Their state grows linearly with part count; at high part counts they become a separate bottleneck. Cloudflare hints at a "100 gigabyte ZooKeeper cluster" as a still-open second-order problem.
  • Background merges in MergeTree merge small parts into larger ones; the merge scheduler's per-cluster work scales with part count.

The migration that "shouldn't have" changed per-query cost did change per-cluster cost — because partition count is a hidden cost axis orthogonal to the per-query metrics teams normally watch.

Why the assumption fails

Cloudflare's design-review reasoning was "since every query is filtered by a specific namespace, the number of parts read by any single query shouldn't change." This is correct for per-query data plane work (pruning narrows to the same number of partitions; the same rows are read). It misses the per-cluster control-plane work (planning, metadata management, merge scheduling) that has to consider the whole parts list before per-query pruning can run.

The conceptual gap is:

Cost axis Scales with Affected by partitioning-key extension?
Per-query rows scanned Selected partitions × rows-per-partition No (data plane is unchanged)
Per-query parts read Selected partitions No (pruning still works)
Per-query I/O Selected partitions × parts-per-partition × granules No
Per-query planning time Total cluster part count Yes (control plane)
Cluster metadata size Total cluster part count Yes
Background merge work Total cluster part count Yes

Three rows on the bottom — three control-plane cost axes — all silently amplified by the migration.

Mitigations

When this design idiom is the right call (and it often is — the alternative "table per tenant" requires significantly more automation), the partition-count cost has known mitigations:

  • Shared-lock + cached-snapshot planner internals — reduce per-query planning cost. Canonical wiki instance: patterns/shared-lock-for-read-only-metadata + patterns/deferred-copy-cached-collection (Cloudflare's Optimizations 1 + 2 to ClickHouse).
  • Sorted-prefix binary search on the partition key — exploits the fact that the parts list is sorted by the partition key to skip most of the per-query linear scan. Canonical wiki instance: patterns/binary-search-on-sorted-partition-prefix (Cloudflare's Optimization 3).
  • Larger time bucket — partition by (tenant, week) or (tenant, month) instead of (tenant, day) — trades retention granularity for fewer partitions.
  • Tiered retention — bucket tenants into a small number of retention classes rather than per-tenant; the partitioning key becomes (retention_class, day) rather than (tenant, day), multiplying part count by the much smaller class count.

Cloudflare's post leaves the architectural-level alternative (table per tenant) open as a future possibility: "Was this partitioning scheme the right long-term choice? Or will we eventually need to bite the bullet and move to a different architecture?"

Adjacent at other altitudes

  • Application-layer multi-tenancy — multi-tenant SaaS systems often face the same trade-off at the application- schema level: per-tenant tables (clean isolation, hard to scale) vs. shared tables with tenant column (clean scaling, harder to express per-tenant lifecycle). See concepts/polymorphic-usage-tables-multi-tenant for the application-altitude cousin.
  • Lakehouse partitioning — Delta Lake / Iceberg / Hudi tables face the same partition-count-vs-pruning- granularity trade-off; "don't over-partition" is the canonical advice precisely because the file-listing / manifest-rewrite cost scales with partition count.
  • Object-storage prefix design — S3 / GCS bucket partitioning by tenant prefix has the same shape; the hidden cost is per-prefix metadata operations rather than per-query planning, but the same O(tenants × time) growth applies.

Seen in

  • sources/2026-05-14-cloudflare-clickhouse-query-plan-contention — Cloudflare's Ready-Analytics migration from PARTITION BY (day) (uniform 31-day retention) to PARTITION BY (namespace, day) (per-tenant retention). Canonicalises the design-review assumption that "per- query data scanned is unchanged → query duration is unchanged" as load-bearing and wrong: per-cluster control-plane work (planning, metadata, merges) scales with total part count, which the migration grew from thousands to hundreds of thousands of parts. The post is the wiki's foundational instance of partition-count-as- hidden-cost-axis.
Last updated · 542 distilled / 1,571 read