Skip to content

PATTERN Cited by 1 source

Dynamic partition split async pipeline

Definition

A four-stage asynchronous pipeline for detecting wide partitions on the read path, planning a split, executing the split into a separately-named target table validated by checksums, and routing future reads transparently while keeping the original partition as a permanent fallback. The pipeline composes:

  • Detection on the read path → Kafka event
  • Planning by an async worker → reads source partition once + writes plan + pre-split checksum to a wide_row metadata table
  • Splitting by an async worker → writes split partitions to a separately-named target table + writes post-split checksum + flips status to COMPLETED only on checksum match
  • Serving on the read path → Bloom-filter gate + metadata-table lookup + delegate to existing reader against split table; original partition never deleted

Canonicalised by Netflix's TimeSeries Abstraction in 2026-06-03 (Source: sources/2026-06-03-netflix-dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloads).

Pipeline diagram

                ┌──────────────────────────────────────────────────────────────────┐
                │                                                                  │
                │  TimeSeries server (read path)                                   │
                │    1. Read partition (track bytes_read_per_partition)            │
                │    2. If bytes > threshold AND immutable(partition):             │
                │         emit Kafka event                                         │
                │                                                                  │
                └────────────┬─────────────────────────────────────────────────────┘
                    ┌─────────────────────────────────┐
                    │  Kafka topic: wide_partition_   │
                    │  detection_events               │
                    └────────────┬────────────────────┘
                ┌────────────────▼─────────────────┐
                │  Planner worker                  │
                │   - Reads entire partition once  │
                │   - Resumable from checkpoint    │
                │   - Computes pre-split checksum  │
                │   - Writes plan to wide_row meta │
                └────────────┬─────────────────────┘
                ┌──────────────────────────────────┐
                │  Splitter worker                 │
                │   (EventBucketPartitionSplit-    │
                │    Strategy by default)          │
                │   - Writes split partitions to   │
                │     a separate time-slice table  │
                │     (e.g. wide_data_20260328_0)  │
                │   - Caps event-bucket count for  │
                │     ultra-wide partitions        │
                │     (read-amplification cap)     │
                │   - Computes post-split checksum │
                │   - On match: status COMPLETED   │
                │   - On mismatch: stays in-flight │
                └────────────┬─────────────────────┘
                ┌──────────────────────────────────┐
                │  TimeSeries server periodically  │
                │  loads completed-split partition │
                │  keys into in-memory             │
                │  Bloom filters                   │
                └────────────┬─────────────────────┘
                ┌──────────────────────────────────┐
                │  Read path                       │
                │   1. Bloom filter check          │
                │   2. On hit: read wide_row meta  │
                │      (read-through cached)       │
                │   3. Delegate to existing        │
                │      PartitionReader against     │
                │      split table                 │
                │   4. On miss / fallback:         │
                │      read original partition     │
                └──────────────────────────────────┘

Why each stage exists

Stage Could it be replaced? Why this design
Read-side detection Could detect on writes Most data is cold; read-side avoids paying detection cost on partitions nobody reads — see concepts/read-side-detection-of-storage-pathology
Kafka event Could call worker synchronously Sync would block reads on remediation; Kafka decouples detection from remediation throughput
Single full read for planning Could plan from sampling Splits must be exact; sampling-based planning could miss row clusters
Checkpointed planner reads Could re-read from start on failure Wide partitions take a long time to read; checkpoints amortise cost across partial-read failures
Separate target table Could split in place In-place splits would fight Cassandra's compaction; separate table cleanly isolates split data
Pre/post checksum gate Could trust the splitter Splitter bugs would silently lose / duplicate data; checksum gate makes correctness bugs fail-loud — see concepts/checksum-validated-data-migration
Read-amplification cap Could split unboundedly 1 wide partition → 100 split partitions = 100× read fan-out; cap controls this
Bloom-filter gate Could metadata-lookup on every read Bloom filter check is single-digit microseconds, metadata lookup is O(disk); skip metadata when no split exists
wide_row metadata table Could compute split mapping from convention Splits aren't deterministic from inputs (depend on planner output); metadata table records the actual mapping
Read-through cache on metadata Could read from disk every time Metadata reads happen on every Bloom-hit; cache is the difference between fast and slow
Same schema for split table Could use a different schema Same schema lets Netflix reuse PartitionReader; minimises code duplication
Original never deleted Could delete after COMPLETED Fallback safety is high-value; storage cost is acceptable — see patterns/keep-original-partition-as-fallback-during-split

When to use

  • Wide-partition outliers exist on a partitioned wide-column store (Cassandra, ScyllaDB, HBase) where most partitions are healthy but a small fraction need splitting.
  • The partitions to split are immutable (or there's an acceptLimit-style time bound after which partitions can be treated as immutable).
  • The downstream reader can be parameterised by partition mapping (vs hardcoded to a single table).
  • Operational safety matters more than storage cost — original partition retained as fallback.

When NOT to use

  • All partitions are mutable indefinitely — checksum gate cannot run; pipeline becomes much more complex.
  • The pathology is table-wide, not per-key — use the table-level auto-tuning control loop instead.
  • Storage cost dominates — the "never delete original" property doubles storage for split partitions; if storage is the constraint, schedule eventual deletion via TTL after a multi-week safety window.
  • Reads are uniform over partitions — read-side detection won't fire often enough to amortise wide-partition cost; use schema-level bucketing from day 1.
  • Strong consistency required across split / non-split — the asynchronous split → read-divert flow has a time window where the read path can land on either side.

Trade-offs

Pro Con
Per-ID granularity — only outlier IDs treated Detection lag — first reads after threshold breach still slow
Original retained as fallback Storage overhead permanent
Bloom-filter gate makes routing free for non-split partitions Bloom-filter memory grows with COMPLETED-split count
Same-schema split table → minimal code blast radius Tied to one storage engine's partition schema
Pre/post checksum + offline Spark verification Hash collisions theoretically possible (mitigated by offline Spark)
Read-amplification cap on ultra-wide Cap means some partitions still wide post-split
Decoupled stages → each can scale independently More moving parts (Kafka, planner, splitter, metadata table, Bloom filter, cache)
Phased rollout possible per dataset Phased rollout requires extra plumbing per phase — see patterns/phased-rollout-of-read-mode

Composes with

Seen in

  • sources/2026-06-03-netflix-dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloadsCanonical wiki home. Netflix TimeSeries Abstraction's full four-stage pipeline against Apache Cassandra 4.x. Detection emits to Kafka; Planner writes to wide_row metadata table; default split strategy EventBucketPartitionSplitStrategy assigns more event buckets to the same time bucket with a cap on ultra-wide partitions; checksum gate must pass before status COMPLETED; Bloom-filter routing on the read path with read-through-cached metadata lookup; original partition never deleted; same schema for split table reuses existing PartitionReader; outcomes ms-scale average and tail (vs seconds before) with 500 MB+ partitions paginable while available.
Last updated · 542 distilled / 1,571 read