PATTERN Cited by 1 source
Dynamic partition split async pipeline¶
Definition¶
A four-stage asynchronous pipeline for detecting wide partitions on the read path, planning a split, executing the split into a separately-named target table validated by checksums, and routing future reads transparently while keeping the original partition as a permanent fallback. The pipeline composes:
- Detection on the read path → Kafka event
- Planning by an async worker → reads source partition once + writes plan + pre-split checksum to a
wide_rowmetadata table - Splitting by an async worker → writes split partitions to a separately-named target table + writes post-split checksum + flips status to
COMPLETEDonly on checksum match - Serving on the read path → Bloom-filter gate + metadata-table lookup + delegate to existing reader against split table; original partition never deleted
Canonicalised by Netflix's TimeSeries Abstraction in 2026-06-03 (Source: sources/2026-06-03-netflix-dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloads).
Pipeline diagram¶
┌──────────────────────────────────────────────────────────────────┐
│ │
│ TimeSeries server (read path) │
│ 1. Read partition (track bytes_read_per_partition) │
│ 2. If bytes > threshold AND immutable(partition): │
│ emit Kafka event │
│ │
└────────────┬─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────┐
│ Kafka topic: wide_partition_ │
│ detection_events │
└────────────┬────────────────────┘
│
┌────────────────▼─────────────────┐
│ Planner worker │
│ - Reads entire partition once │
│ - Resumable from checkpoint │
│ - Computes pre-split checksum │
│ - Writes plan to wide_row meta │
└────────────┬─────────────────────┘
▼
┌──────────────────────────────────┐
│ Splitter worker │
│ (EventBucketPartitionSplit- │
│ Strategy by default) │
│ - Writes split partitions to │
│ a separate time-slice table │
│ (e.g. wide_data_20260328_0) │
│ - Caps event-bucket count for │
│ ultra-wide partitions │
│ (read-amplification cap) │
│ - Computes post-split checksum │
│ - On match: status COMPLETED │
│ - On mismatch: stays in-flight │
└────────────┬─────────────────────┘
▼
┌──────────────────────────────────┐
│ TimeSeries server periodically │
│ loads completed-split partition │
│ keys into in-memory │
│ Bloom filters │
└────────────┬─────────────────────┘
▼
┌──────────────────────────────────┐
│ Read path │
│ 1. Bloom filter check │
│ 2. On hit: read wide_row meta │
│ (read-through cached) │
│ 3. Delegate to existing │
│ PartitionReader against │
│ split table │
│ 4. On miss / fallback: │
│ read original partition │
└──────────────────────────────────┘
Why each stage exists¶
| Stage | Could it be replaced? | Why this design |
|---|---|---|
| Read-side detection | Could detect on writes | Most data is cold; read-side avoids paying detection cost on partitions nobody reads — see concepts/read-side-detection-of-storage-pathology |
| Kafka event | Could call worker synchronously | Sync would block reads on remediation; Kafka decouples detection from remediation throughput |
| Single full read for planning | Could plan from sampling | Splits must be exact; sampling-based planning could miss row clusters |
| Checkpointed planner reads | Could re-read from start on failure | Wide partitions take a long time to read; checkpoints amortise cost across partial-read failures |
| Separate target table | Could split in place | In-place splits would fight Cassandra's compaction; separate table cleanly isolates split data |
| Pre/post checksum gate | Could trust the splitter | Splitter bugs would silently lose / duplicate data; checksum gate makes correctness bugs fail-loud — see concepts/checksum-validated-data-migration |
| Read-amplification cap | Could split unboundedly | 1 wide partition → 100 split partitions = 100× read fan-out; cap controls this |
| Bloom-filter gate | Could metadata-lookup on every read | Bloom filter check is single-digit microseconds, metadata lookup is O(disk); skip metadata when no split exists |
| wide_row metadata table | Could compute split mapping from convention | Splits aren't deterministic from inputs (depend on planner output); metadata table records the actual mapping |
| Read-through cache on metadata | Could read from disk every time | Metadata reads happen on every Bloom-hit; cache is the difference between fast and slow |
| Same schema for split table | Could use a different schema | Same schema lets Netflix reuse PartitionReader; minimises code duplication |
| Original never deleted | Could delete after COMPLETED | Fallback safety is high-value; storage cost is acceptable — see patterns/keep-original-partition-as-fallback-during-split |
When to use¶
- Wide-partition outliers exist on a partitioned wide-column store (Cassandra, ScyllaDB, HBase) where most partitions are healthy but a small fraction need splitting.
- The partitions to split are immutable (or there's an
acceptLimit-style time bound after which partitions can be treated as immutable). - The downstream reader can be parameterised by partition mapping (vs hardcoded to a single table).
- Operational safety matters more than storage cost — original partition retained as fallback.
When NOT to use¶
- All partitions are mutable indefinitely — checksum gate cannot run; pipeline becomes much more complex.
- The pathology is table-wide, not per-key — use the table-level auto-tuning control loop instead.
- Storage cost dominates — the "never delete original" property doubles storage for split partitions; if storage is the constraint, schedule eventual deletion via TTL after a multi-week safety window.
- Reads are uniform over partitions — read-side detection won't fire often enough to amortise wide-partition cost; use schema-level bucketing from day 1.
- Strong consistency required across split / non-split — the asynchronous split → read-divert flow has a time window where the read path can land on either side.
Trade-offs¶
| Pro | Con |
|---|---|
| Per-ID granularity — only outlier IDs treated | Detection lag — first reads after threshold breach still slow |
| Original retained as fallback | Storage overhead permanent |
| Bloom-filter gate makes routing free for non-split partitions | Bloom-filter memory grows with COMPLETED-split count |
| Same-schema split table → minimal code blast radius | Tied to one storage engine's partition schema |
| Pre/post checksum + offline Spark verification | Hash collisions theoretically possible (mitigated by offline Spark) |
| Read-amplification cap on ultra-wide | Cap means some partitions still wide post-split |
| Decoupled stages → each can scale independently | More moving parts (Kafka, planner, splitter, metadata table, Bloom filter, cache) |
| Phased rollout possible per dataset | Phased rollout requires extra plumbing per phase — see patterns/phased-rollout-of-read-mode |
Composes with¶
- patterns/keep-original-partition-as-fallback-during-split — the operational-safety pattern.
- patterns/bloom-filter-redirect-to-split-partition — the read-path divert.
- patterns/shadow-mode-bytes-comparison — additional correctness gate during rollout.
- patterns/phased-rollout-of-read-mode — the rollout discipline.
- patterns/auto-tuning-control-loop-on-storage-histograms — sibling table-level remediation.
Seen in¶
- sources/2026-06-03-netflix-dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloads —
Canonical wiki home. Netflix TimeSeries Abstraction's full four-stage pipeline against
Apache Cassandra 4.x. Detection emits to Kafka;
Planner writes to
wide_rowmetadata table; default split strategyEventBucketPartitionSplitStrategyassigns more event buckets to the same time bucket with a cap on ultra-wide partitions; checksum gate must pass before status COMPLETED; Bloom-filter routing on the read path with read-through-cached metadata lookup; original partition never deleted; same schema for split table reuses existingPartitionReader; outcomes ms-scale average and tail (vs seconds before) with 500 MB+ partitions paginable while available.
Related¶
- concepts/dynamic-partition-splitting — the broader concept this pattern instantiates.
- concepts/wide-partition-problem — the failure this pipeline remediates.
- concepts/read-side-detection-of-storage-pathology — the detection-stage design choice.
- concepts/immutable-partition — the precondition that lets the checksum gate be meaningful.
- concepts/checksum-validated-data-migration — the correctness gate.
- concepts/bloom-filter — the read-path probabilistic gate.
- systems/netflix-timeseries-abstraction — the canonical instance.
- systems/apache-cassandra — the substrate.
- systems/kafka — the detection-event transport.