CONCEPT Cited by 1 source
Dynamic partition splitting¶
Definition¶
Dynamic partition splitting is the runtime remediation pattern of detecting individual partitions that have grown beyond the storage engine's healthy operating envelope and splitting them asynchronously into multiple smaller partitions while the system continues to serve reads — without rewriting the table, without scheduling downtime, and without deleting the original partition (which remains as a fallback). It sits one altitude up from bucketed event-time partitioning (which prevents wide partitions at write-time via fixed schema-level bucketing) and contrasts with the table-level remediation of "rewrite the whole table with a new partition strategy."
The pattern was canonicalised on the wiki by Netflix's TimeSeries team in the 2026-06-03 disclosure (Source: sources/2026-06-03-netflix-dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloads) — the first wiki canonical example of partition splitting at the granularity of an individual logical key (a TimeSeries ID), not at the table or schema level.
Why it's needed beyond schema-level bucketing¶
Bucketed partitioning is the right answer at provisioning time when workload characteristics are known. Three failure modes break it:
- Workload is unknown or inaccurately estimated — early-project users do not have a reliable picture of production traffic; the per-namespace partitioning configuration picked at namespace creation can be wrong.
- Workload evolves over time — a "good" partitioning strategy on day one becomes inefficient months later.
- Data outliers exist — "a small percentage of IDs can receive a vastly higher volume of events than the rest." This is not a table-level problem; it is a per-key problem.
The first two are addressable by the broader auto-tuning control loop that re-partitions future time slices. The third — outlier IDs that genuinely need to hold a lot of events — requires per-key remediation, which is what dynamic splitting provides.
Mechanism (Netflix TimeSeries instantiation)¶
The mechanism is an asynchronous three-stage pipeline (patterns/dynamic-partition-split-async-pipeline):
| Stage | Where | What happens |
|---|---|---|
| Detection | TimeSeries server, on each read | Server tracks bytes-read per partition; when threshold exceeded, emits Kafka event with time_slice, time_series_id, time_bucket, event_bucket, immutable flag |
| Planning | Async worker | Reads the wide partition entirely (resumable from checkpoint), records pre-split checksum, writes plan to wide_row metadata table |
| Splitting | Async worker | Writes split partitions to a separately-named time-slice table (e.g. wide_data_20260328_0); records post-split checksum; flips status to COMPLETED only on checksum match |
| Serving | TimeSeries server, on each read | Bloom-filter gate; on hit, look up wide_row metadata; route read to split partitions; original partition remains as fallback |
Three load-bearing design choices distinguish this from naive rewrite:
- Detection on reads, not writes. "The majority of the data in the wild doesn't need this treatment." The cost: some reads on wide partitions suffer sub-optimal latency for a few seconds while the pipeline catches up. The benefit: write-path overhead is zero, and only partitions that actually get hit by reads get split. See concepts/read-side-detection-of-storage-pathology.
- Splits target only immutable partitions in v1. "Although splitting mutable partitions is possible, it is inherently more complex. As a first step towards solving this problem, we chose to reduce the surface area of this change…" — explicit deferral.
- Original is never deleted. "This helps us in creating safe fallbacks in many different scenarios of partial failures and eventual consistency." The cost is duplicated storage; the benefit is operational safety during partial failures and the freedom to re-process previously failed splits.
Read-path routing¶
A Bloom filter (concepts/bloom-filter) gates the lookup: every TimeSeries server periodically loads partition keys of completed splits into in-memory Bloom filters. Each read consults the filter; on a hit, the server reads the wide_row metadata table (read-through-cached) which encodes both what to read (pre_split_data) and where to read it from (post_split_data — pointer to the split-table name plus a strategy block specifying e.g. target_event_buckets: 2, start_event_bucket: 32).
Bloom-filter check cost: single-digit microseconds.
The split table reuses the same schema as the original time-slice table, allowing the existing PartitionReader to be delegated to with no code duplication. See patterns/bloom-filter-redirect-to-split-partition.
Read amplification cap¶
For ultra-wide partitions, splits are not unbounded. "If the partition is ultra-wide, we cap the number of event buckets we split into, in order to control the resultant read amplification. Spreading into multiple partitions in such cases is still beneficial in order to spread the read workload to multiple Cassandra replicas." The cap protects read amplification — splitting 1 partition into 100 increases per-query fan-out by 100×, which is sometimes worse than tolerating a wide partition.
Trade-offs¶
| Pro | Con |
|---|---|
| Granular: only IDs that need treatment get treated | Detection lag — first few reads after a partition crosses the threshold still see wide-partition latency |
| Original preserved as fallback → safe rollback under any partial failure | Storage overhead — wide-row partitions are kept indefinitely |
Existing PartitionReader reused for split table → minimal code blast radius |
Bloom-filter memory grows with completed-split count (must be monitored) |
| Read amplification capped on ultra-wide | Cap means some partitions still wide post-split (better than timeout but not perfect) |
| Mutable partitions can be deferred to future work | Mutable partitions deferred to future work means a class of wide partitions is untreated |
Sibling remediations¶
| Remediation | Granularity | When to use |
|---|---|---|
| Dynamic partition splitting (this concept) | Per-ID, asynchronous, runtime | Outlier IDs that legitimately need lots of events |
| patterns/auto-tuning-control-loop-on-storage-histograms | Per-table-future-slice | Workload-wide drift (table-level over- or under-partitioning) |
| patterns/bucketed-event-time-partitioning | Provisioning-time, fixed | New namespaces with predictable workload characteristics |
| patterns/partial-return-on-slo-breach | Per-request, runtime | Clients prefer latency over data completeness |
| Block adversarial IDs (manual) | Per-ID, manual | Spam / test / adversarial IDs (data-quality issue, not a partitioning problem) |
| Do nothing | — | Wide partitions exist but app-level metrics are unaffected |
The Netflix post explicitly enumerates all of these as a hierarchy of options, with dynamic splitting reserved for "valid and important TimeSeries IDs accumulate a high enough volume of events" where partial returns and block-list aren't acceptable.
Seen in¶
- sources/2026-06-03-netflix-dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloads —
First wiki canonicalisation as a named runtime remediation pattern. Netflix TimeSeries Abstraction's
three-stage Detection → Planning + Splitting → Serving pipeline; per-ID granularity;
immutable-partition-only v1 with mutable-partition splitting deferred; original-as-fallback
preservation; Bloom-filter-gated read divert with
wide_rowmetadata table;EventBucketPartitionSplitStrategywith read-amplification cap on ultra-wide partitions; pre/post checksum validation; Spark-via-Data Bridge offline correctness verification; outcome numbers (seconds → low-double-digit-ms average; seconds → ~200ms tail; 500 MB+ partitions paginated successfully).
Related¶
- concepts/wide-partition-problem — the failure this remediates.
- concepts/over-partitioning — the opposite failure mode (table-level), addressed by patterns/auto-tuning-control-loop-on-storage-histograms not by dynamic splitting.
- concepts/immutable-partition — load-bearing precondition for safe in-place splitting.
- concepts/read-side-detection-of-storage-pathology — the design choice to detect on reads instead of writes.
- concepts/checksum-validated-data-migration — the correctness gate.
- concepts/bloom-filter — the read-path probabilistic gate that makes routing free for non-split partitions.
- concepts/read-amplification — the trade-off the splitting cap protects.
- concepts/partition-strategy — the broader concept space.
- systems/netflix-timeseries-abstraction — the canonical instance.
- systems/apache-cassandra — the substrate where this fight lives.
- patterns/dynamic-partition-split-async-pipeline — the canonical pipeline shape.
- patterns/auto-tuning-control-loop-on-storage-histograms — sibling table-level remediation.
- patterns/keep-original-partition-as-fallback-during-split — the operational-safety pattern.
- patterns/bloom-filter-redirect-to-split-partition — the read-path divert pattern.
- patterns/bucketed-event-time-partitioning — the provisioning-time prevention pattern.