PATTERN Cited by 1 source

Dynamic partition split async pipeline¶

Definition¶

A four-stage asynchronous pipeline for detecting wide partitions on the read path, planning a split, executing the split into a separately-named target table validated by checksums, and routing future reads transparently while keeping the original partition as a permanent fallback. The pipeline composes:

Detection on the read path → Kafka event
Planning by an async worker → reads source partition once + writes plan + pre-split checksum to a wide_row metadata table
Splitting by an async worker → writes split partitions to a separately-named target table + writes post-split checksum + flips status to COMPLETED only on checksum match
Serving on the read path → Bloom-filter gate + metadata-table lookup + delegate to existing reader against split table; original partition never deleted

Canonicalised by Netflix's TimeSeries Abstraction in 2026-06-03 (Source: sources/2026-06-03-netflix-dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloads).

Pipeline diagram¶

                ┌──────────────────────────────────────────────────────────────────┐
                │                                                                  │
                │  TimeSeries server (read path)                                   │
                │    1. Read partition (track bytes_read_per_partition)            │
                │    2. If bytes > threshold AND immutable(partition):             │
                │         emit Kafka event                                         │
                │                                                                  │
                └────────────┬─────────────────────────────────────────────────────┘
                             │
                             ▼
                    ┌─────────────────────────────────┐
                    │  Kafka topic: wide_partition_   │
                    │  detection_events               │
                    └────────────┬────────────────────┘
                                 │
                ┌────────────────▼─────────────────┐
                │  Planner worker                  │
                │   - Reads entire partition once  │
                │   - Resumable from checkpoint    │
                │   - Computes pre-split checksum  │
                │   - Writes plan to wide_row meta │
                └────────────┬─────────────────────┘
                             ▼
                ┌──────────────────────────────────┐
                │  Splitter worker                 │
                │   (EventBucketPartitionSplit-    │
                │    Strategy by default)          │
                │   - Writes split partitions to   │
                │     a separate time-slice table  │
                │     (e.g. wide_data_20260328_0)  │
                │   - Caps event-bucket count for  │
                │     ultra-wide partitions        │
                │     (read-amplification cap)     │
                │   - Computes post-split checksum │
                │   - On match: status COMPLETED   │
                │   - On mismatch: stays in-flight │
                └────────────┬─────────────────────┘
                             ▼
                ┌──────────────────────────────────┐
                │  TimeSeries server periodically  │
                │  loads completed-split partition │
                │  keys into in-memory             │
                │  Bloom filters                   │
                └────────────┬─────────────────────┘
                             ▼
                ┌──────────────────────────────────┐
                │  Read path                       │
                │   1. Bloom filter check          │
                │   2. On hit: read wide_row meta  │
                │      (read-through cached)       │
                │   3. Delegate to existing        │
                │      PartitionReader against     │
                │      split table                 │
                │   4. On miss / fallback:         │
                │      read original partition     │
                └──────────────────────────────────┘

Why each stage exists¶

Stage	Could it be replaced?	Why this design
Read-side detection	Could detect on writes	Most data is cold; read-side avoids paying detection cost on partitions nobody reads — see concepts/read-side-detection-of-storage-pathology
Kafka event	Could call worker synchronously	Sync would block reads on remediation; Kafka decouples detection from remediation throughput
Single full read for planning	Could plan from sampling	Splits must be exact; sampling-based planning could miss row clusters
Checkpointed planner reads	Could re-read from start on failure	Wide partitions take a long time to read; checkpoints amortise cost across partial-read failures
Separate target table	Could split in place	In-place splits would fight Cassandra's compaction; separate table cleanly isolates split data
Pre/post checksum gate	Could trust the splitter	Splitter bugs would silently lose / duplicate data; checksum gate makes correctness bugs fail-loud — see concepts/checksum-validated-data-migration
Read-amplification cap	Could split unboundedly	1 wide partition → 100 split partitions = 100× read fan-out; cap controls this
Bloom-filter gate	Could metadata-lookup on every read	Bloom filter check is single-digit microseconds, metadata lookup is O(disk); skip metadata when no split exists
wide_row metadata table	Could compute split mapping from convention	Splits aren't deterministic from inputs (depend on planner output); metadata table records the actual mapping
Read-through cache on metadata	Could read from disk every time	Metadata reads happen on every Bloom-hit; cache is the difference between fast and slow
Same schema for split table	Could use a different schema	Same schema lets Netflix reuse `PartitionReader`; minimises code duplication
Original never deleted	Could delete after COMPLETED	Fallback safety is high-value; storage cost is acceptable — see patterns/keep-original-partition-as-fallback-during-split

When to use¶

Wide-partition outliers exist on a partitioned wide-column store (Cassandra, ScyllaDB, HBase) where most partitions are healthy but a small fraction need splitting.
The partitions to split are immutable (or there's an acceptLimit-style time bound after which partitions can be treated as immutable).
The downstream reader can be parameterised by partition mapping (vs hardcoded to a single table).
Operational safety matters more than storage cost — original partition retained as fallback.

When NOT to use¶

All partitions are mutable indefinitely — checksum gate cannot run; pipeline becomes much more complex.
The pathology is table-wide, not per-key — use the table-level auto-tuning control loop instead.
Storage cost dominates — the "never delete original" property doubles storage for split partitions; if storage is the constraint, schedule eventual deletion via TTL after a multi-week safety window.
Reads are uniform over partitions — read-side detection won't fire often enough to amortise wide-partition cost; use schema-level bucketing from day 1.
Strong consistency required across split / non-split — the asynchronous split → read-divert flow has a time window where the read path can land on either side.

Trade-offs¶

Pro	Con
Per-ID granularity — only outlier IDs treated	Detection lag — first reads after threshold breach still slow
Original retained as fallback	Storage overhead permanent
Bloom-filter gate makes routing free for non-split partitions	Bloom-filter memory grows with COMPLETED-split count
Same-schema split table → minimal code blast radius	Tied to one storage engine's partition schema
Pre/post checksum + offline Spark verification	Hash collisions theoretically possible (mitigated by offline Spark)
Read-amplification cap on ultra-wide	Cap means some partitions still wide post-split
Decoupled stages → each can scale independently	More moving parts (Kafka, planner, splitter, metadata table, Bloom filter, cache)
Phased rollout possible per dataset	Phased rollout requires extra plumbing per phase — see patterns/phased-rollout-of-read-mode

Composes with¶

patterns/keep-original-partition-as-fallback-during-split — the operational-safety pattern.
patterns/bloom-filter-redirect-to-split-partition — the read-path divert.
patterns/shadow-mode-bytes-comparison — additional correctness gate during rollout.
patterns/phased-rollout-of-read-mode — the rollout discipline.
patterns/auto-tuning-control-loop-on-storage-histograms — sibling table-level remediation.

Seen in¶

sources/2026-06-03-netflix-dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloads — Canonical wiki home. Netflix TimeSeries Abstraction's full four-stage pipeline against Apache Cassandra 4.x. Detection emits to Kafka; Planner writes to wide_row metadata table; default split strategy EventBucketPartitionSplitStrategy assigns more event buckets to the same time bucket with a cap on ultra-wide partitions; checksum gate must pass before status COMPLETED; Bloom-filter routing on the read path with read-through-cached metadata lookup; original partition never deleted; same schema for split table reuses existing PartitionReader; outcomes ms-scale average and tail (vs seconds before) with 500 MB+ partitions paginable while available.

concepts/dynamic-partition-splitting — the broader concept this pattern instantiates.
concepts/wide-partition-problem — the failure this pipeline remediates.
concepts/read-side-detection-of-storage-pathology — the detection-stage design choice.
concepts/immutable-partition — the precondition that lets the checksum gate be meaningful.
concepts/checksum-validated-data-migration — the correctness gate.
concepts/bloom-filter — the read-path probabilistic gate.
systems/netflix-timeseries-abstraction — the canonical instance.
systems/apache-cassandra — the substrate.
systems/kafka — the detection-event transport.