Skip to content

PATTERN Cited by 1 source

Per-tenant retention via partition-key extension

Intent

In a multi-tenant analytics or storage system that uses partition-based retention (drop partitions older than N days), let different tenants apply different retention policies without running per-tenant migrations or maintaining per-tenant tables.

Solution: extend the partitioning key with a tenant- identifier column at the front (e.g., PARTITION BY (tenant, day) instead of just PARTITION BY (day)), so each tenant's data lives in its own set of partitions. Then express per-tenant retention as "drop partitions where the tenant column is T and the day is older than the retention configured for T." The existing partition-drop machinery is repurposed; no schema migration is required for existing tenants who keep the default policy.

Canonical instance: Cloudflare Ready-Analytics

Cloudflare's Ready-Analytics table held 2 PiB of analytics data across hundreds of namespaces (tenants). The original PARTITION BY (day) applied a uniform 31-day retention to every tenant — the critical operational flaw that motivated the migration.

Two architectural alternatives:

  1. Table-per-namespace: clean isolation, but requires "thousands of tables on demand" and significant new automation. Rejected on operational grounds.
  2. PARTITION BY (namespace, day): existing retention machinery extends naturally. Chosen.

The migration ran from January 2025 onward. Each tenant's data now lives in its own per-day partitions; per- tenant retention is a per-namespace iteration over the partition list. The existing Merge table feature combined old and new tables during the migration, with new data going to the new partitioned table while old data aged out under the original schema.

(Source: sources/2026-05-14-cloudflare-clickhouse-query-plan-contention)

Trade-offs vs. the alternatives

Axis (day) uniform (tenant, day) partition-key extension Table-per-tenant
Per-tenant retention granularity None Full (any retention per tenant) Full
Schema migration None One ALTER (or rebuild via Merge) Per-tenant table create
Per-query routing Single namespace filter Same single namespace filter Routing layer per query
Partition-pruning efficiency High (single partition per day) High (single partition per (tenant, day)) N/A (one table per tenant)
Cluster part count O(days) O(tenants × days) O(tenants × days) (across tables)
Cluster table count O(1) O(1) O(tenants)
Operational substrate Existing retention job Existing retention job (works as-is) Per-tenant table lifecycle automation
Hidden cost None Per-cluster planner / metadata work scales with part count Per-cluster table-list / catalog work scales with table count

Both non-default options pay an O(tenants) cost. The partition-key extension pushes that cost into part count (planner / metadata bottleneck — see concepts/lock-contention-in-query-planning); the table-per-tenant path pushes that cost into table count (catalog / control-plane / automation bottleneck — see concepts/catalog-bloat-multi-tenant).

The pattern wins when:

  • Per-tenant table-list automation is heavy and infrastructure-team bandwidth is limited.
  • Per-cluster part-count cost can be capped or mitigated.
  • Tenant count × retention-days is high enough to want per-tenant retention but not so high that part-count cost is structurally unbounded.
  • The storage engine supports partition-key prefix pruning efficiently (a sorted-prefix binary search on the partition key — see patterns/binary-search-on-sorted-partition-prefix — is a hard requirement at scale).

Implementation steps

  1. Confirm every read query filters by the tenant column. If some queries don't, partition pruning fails and the new scheme is strictly worse than the old.
  2. Decide the new partition-key column order(tenant, day) (tenant prefix, time suffix) is the canonical Cloudflare order; (day, tenant) would sacrifice tenant-prefix pruning for time-prefix pruning. Pick the order that matches the dominant query shape.
  3. Migrate data. ClickHouse's Merge table feature is the textbook approach: create the new partitioned table, write all new data there, leave the old table to age out under its original retention. Reads union the two via a Merge view.
  4. Express per-tenant retention in the retention job: per-namespace lookup of retention-days, then iterate matching partitions, drop those older than retention.
  5. Monitor part-count growth. Plot total parts per replica over time; correlate with query-duration percentiles. The point at which the correlation becomes visible is when planner-side mitigations become urgent.
  6. Land planner-side mitigations as part-count grows. Cloudflare's stack: patterns/shared-lock-for-read-only-metadata + patterns/deferred-copy-cached-collection + patterns/binary-search-on-sorted-partition-prefix. The first two ship upstream as ClickHouse PR #85535.

Hard problems

  • Partition-count growth is unbounded in tenant × retention-days product. At Cloudflare scale (hundreds of namespaces × years for some tenants) this product is large. Mitigations buy headroom but don't change the asymptotic shape.
  • Tenant-bursty insert rate can create many small parts per tenant per day, multiplying part count further. Tuning merge thresholds to consolidate small parts is part of the operational picture.
  • Cross-tenant aggregation queries (queries that don't filter to a specific tenant) lose partition-pruning benefit and revert to scanning all parts. Such queries also lose the binary-search-on-prefix optimisation.
  • The fundamental architectural question"Was this partitioning scheme the right long-term choice? Or will we eventually need to bite the bullet and move to a different architecture?" — is left open by Cloudflare's post. The pattern works at 160k parts/replica with all three optimisations; whether it works at 1M parts/replica is unanswered.
  • Cluster-metadata-store pressure: the same part-count growth that pressured the planner also pressured ZooKeeper (Cloudflare hints at a "100 GiB ZooKeeper cluster"). Operators adopting this pattern should monitor the metadata-store size as a separate cost axis.

Adjacent patterns

Seen in

  • sources/2026-05-14-cloudflare-clickhouse-query-plan-contention — canonical wiki instance. Cloudflare's Ready-Analytics migration from (day) to (namespace, day) partitioning, the rejected table-per-namespace alternative, the post-migration hidden-cost realisation in query planning, and the three-patch upstream-the-fix mitigation stack. Names the pattern's load-bearing assumption ("per-query parts read shouldn't change") as correct-but-insufficient and the architectural-level question as still open.
Last updated · 542 distilled / 1,571 read