Skip to content

CONCEPT Cited by 1 source

Dynamic partition splitting

Definition

Dynamic partition splitting is the runtime remediation pattern of detecting individual partitions that have grown beyond the storage engine's healthy operating envelope and splitting them asynchronously into multiple smaller partitions while the system continues to serve reads — without rewriting the table, without scheduling downtime, and without deleting the original partition (which remains as a fallback). It sits one altitude up from bucketed event-time partitioning (which prevents wide partitions at write-time via fixed schema-level bucketing) and contrasts with the table-level remediation of "rewrite the whole table with a new partition strategy."

The pattern was canonicalised on the wiki by Netflix's TimeSeries team in the 2026-06-03 disclosure (Source: sources/2026-06-03-netflix-dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloads) — the first wiki canonical example of partition splitting at the granularity of an individual logical key (a TimeSeries ID), not at the table or schema level.

Why it's needed beyond schema-level bucketing

Bucketed partitioning is the right answer at provisioning time when workload characteristics are known. Three failure modes break it:

  1. Workload is unknown or inaccurately estimated — early-project users do not have a reliable picture of production traffic; the per-namespace partitioning configuration picked at namespace creation can be wrong.
  2. Workload evolves over time — a "good" partitioning strategy on day one becomes inefficient months later.
  3. Data outliers exist"a small percentage of IDs can receive a vastly higher volume of events than the rest." This is not a table-level problem; it is a per-key problem.

The first two are addressable by the broader auto-tuning control loop that re-partitions future time slices. The third — outlier IDs that genuinely need to hold a lot of events — requires per-key remediation, which is what dynamic splitting provides.

Mechanism (Netflix TimeSeries instantiation)

The mechanism is an asynchronous three-stage pipeline (patterns/dynamic-partition-split-async-pipeline):

Detection (read-path) → Planning + Splitting (async worker) → Serving (read-path divert)
Stage Where What happens
Detection TimeSeries server, on each read Server tracks bytes-read per partition; when threshold exceeded, emits Kafka event with time_slice, time_series_id, time_bucket, event_bucket, immutable flag
Planning Async worker Reads the wide partition entirely (resumable from checkpoint), records pre-split checksum, writes plan to wide_row metadata table
Splitting Async worker Writes split partitions to a separately-named time-slice table (e.g. wide_data_20260328_0); records post-split checksum; flips status to COMPLETED only on checksum match
Serving TimeSeries server, on each read Bloom-filter gate; on hit, look up wide_row metadata; route read to split partitions; original partition remains as fallback

Three load-bearing design choices distinguish this from naive rewrite:

  • Detection on reads, not writes. "The majority of the data in the wild doesn't need this treatment." The cost: some reads on wide partitions suffer sub-optimal latency for a few seconds while the pipeline catches up. The benefit: write-path overhead is zero, and only partitions that actually get hit by reads get split. See concepts/read-side-detection-of-storage-pathology.
  • Splits target only immutable partitions in v1. "Although splitting mutable partitions is possible, it is inherently more complex. As a first step towards solving this problem, we chose to reduce the surface area of this change…" — explicit deferral.
  • Original is never deleted. "This helps us in creating safe fallbacks in many different scenarios of partial failures and eventual consistency." The cost is duplicated storage; the benefit is operational safety during partial failures and the freedom to re-process previously failed splits.

Read-path routing

A Bloom filter (concepts/bloom-filter) gates the lookup: every TimeSeries server periodically loads partition keys of completed splits into in-memory Bloom filters. Each read consults the filter; on a hit, the server reads the wide_row metadata table (read-through-cached) which encodes both what to read (pre_split_data) and where to read it from (post_split_data — pointer to the split-table name plus a strategy block specifying e.g. target_event_buckets: 2, start_event_bucket: 32).

Bloom-filter check cost: single-digit microseconds.

The split table reuses the same schema as the original time-slice table, allowing the existing PartitionReader to be delegated to with no code duplication. See patterns/bloom-filter-redirect-to-split-partition.

Read amplification cap

For ultra-wide partitions, splits are not unbounded. "If the partition is ultra-wide, we cap the number of event buckets we split into, in order to control the resultant read amplification. Spreading into multiple partitions in such cases is still beneficial in order to spread the read workload to multiple Cassandra replicas." The cap protects read amplification — splitting 1 partition into 100 increases per-query fan-out by 100×, which is sometimes worse than tolerating a wide partition.

Trade-offs

Pro Con
Granular: only IDs that need treatment get treated Detection lag — first few reads after a partition crosses the threshold still see wide-partition latency
Original preserved as fallback → safe rollback under any partial failure Storage overhead — wide-row partitions are kept indefinitely
Existing PartitionReader reused for split table → minimal code blast radius Bloom-filter memory grows with completed-split count (must be monitored)
Read amplification capped on ultra-wide Cap means some partitions still wide post-split (better than timeout but not perfect)
Mutable partitions can be deferred to future work Mutable partitions deferred to future work means a class of wide partitions is untreated

Sibling remediations

Remediation Granularity When to use
Dynamic partition splitting (this concept) Per-ID, asynchronous, runtime Outlier IDs that legitimately need lots of events
patterns/auto-tuning-control-loop-on-storage-histograms Per-table-future-slice Workload-wide drift (table-level over- or under-partitioning)
patterns/bucketed-event-time-partitioning Provisioning-time, fixed New namespaces with predictable workload characteristics
patterns/partial-return-on-slo-breach Per-request, runtime Clients prefer latency over data completeness
Block adversarial IDs (manual) Per-ID, manual Spam / test / adversarial IDs (data-quality issue, not a partitioning problem)
Do nothing Wide partitions exist but app-level metrics are unaffected

The Netflix post explicitly enumerates all of these as a hierarchy of options, with dynamic splitting reserved for "valid and important TimeSeries IDs accumulate a high enough volume of events" where partial returns and block-list aren't acceptable.

Seen in

  • sources/2026-06-03-netflix-dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloadsFirst wiki canonicalisation as a named runtime remediation pattern. Netflix TimeSeries Abstraction's three-stage Detection → Planning + Splitting → Serving pipeline; per-ID granularity; immutable-partition-only v1 with mutable-partition splitting deferred; original-as-fallback preservation; Bloom-filter-gated read divert with wide_row metadata table; EventBucketPartitionSplitStrategy with read-amplification cap on ultra-wide partitions; pre/post checksum validation; Spark-via-Data Bridge offline correctness verification; outcome numbers (seconds → low-double-digit-ms average; seconds → ~200ms tail; 500 MB+ partitions paginated successfully).
Last updated · 542 distilled / 1,571 read