Skip to content

Netflix — Dynamically Splitting Wide Partitions in Cassandra for Time Series Workloads

Summary

Netflix's TimeSeries Abstraction team — Rajiv Shringi, Kaidan Fullerton, Oleksii Tkachuk, and Kartik Sathyanarayanan — discloses how they fight the wide-partition problem at the granularity of individual TimeSeries IDs on top of Apache Cassandra 4.x. Two complementary mechanisms ship: (1) a table-level auto-tuning control loop (patterns/auto-tuning-control-loop-on-storage-histograms) that watches nodetool tablehistograms via a Cassandra virtual table and rewrites future Time-Slice partition strategy when the p99 partition size deviates from a configured 2–10 MiB density target — fixing concepts/over-partitioning as well as under-partitioning; and (2) per-ID dynamic partition splitting — an asynchronous pipeline that detects wide partitions on reads rather than writes (concepts/read-side-detection-of-storage-pathology), splits immutable partitions only (concepts/immutable-partition) into multiple event-bucket partitions in a separately-named time-slice table, validates the split with pre/post checksums (concepts/checksum-validated-data-migration), and serves reads transparently via a Bloom-filter-gated read-path divert that consults a wide_row metadata table to route the query to the post-split partitions. The original wide partition is never deleted — it remains as a fallback (patterns/keep-original-partition-as-fallback-during-split). Two mid-stack remedies ship alongside for cases where dynamic split isn't appropriate: partial-return on SLO breach (patterns/partial-return-on-slo-breach) for clients that prefer latency over completeness, and manual ID block-listing for adversarial / spam IDs.

Key takeaways

  1. Wide partitions on Cassandra cause seconds-scale tail latency, GC storms, high CPU, thread queueing, and read timeouts — not just slowness. "In extreme cases, if most of the reads target wide partitions, we can see Garbage Collection pauses, high CPU utilization and thread queueing." The article shows median reads at single-digit ms compared to seconds-scale tail on wide-partition tables. (Source: this post.)

  2. Provisioning-time partition strategy is necessary but insufficient. Netflix's TimeSeries provisioning pipeline runs Monte-Carlo simulations over user-provided workload characteristics to pick optimal partitioning configuration per namespace, but the strategy fails when "workload is unknown or inaccurately estimated," "workload evolves over time," or "data outliers exist." Discrete Time Slices give an escape hatch because each new slice can use a different partitioning strategy without rewriting old data.

  3. Auto-tuning at the table level: future Time Slices get re-partitioned when histograms drift. A background worker exposes nodetool tablehistograms percentile distributions through a Cassandra virtual table, watches for partitions outside the configured 2–10 MiB density window, and proposes a new partitioning config for the next Time Slice. The post shows a live example of fixing over-partitioning (60-second time buckets producing < 10 KB partitions) by widening the time bucket to 7 days (time_bucket interval: 60s -> 604800s). Past slices are not rewritten. (Source: this post — "DynamicTimeSliceConfigWorker" example.)

  4. Some IDs need per-ID treatment, not table-wide treatment. Even after table-level tuning, "valid and important TimeSeries IDs accumulate a high enough volume of events" to be wide. Three mid-stack options: do nothing (if app metrics aren't impacted); partial-return (return whatever data was collected before the SLO deadline + a continuation token — works for latency-prioritising clients); block IDs (remove adversarial IDs entirely — "dgwts.config.<dataset>.block.Ids: \"<tsid-1>, <tsid-2>, <tsid-3>\""). (Source: this post.)

  5. The dynamic-partition-split pipeline has three stages: Detection, Planning & Splitting, Serving Reads. Detection happens on the read path (intentionally, not the write path — "the majority of the data in the wild doesn't need this treatment"). Each read tracks bytes-read-per-partition; if the byte threshold is exceeded, the server emits a Kafka event whose payload includes time_slice, time_series_id, time_bucket, event_bucket, an immutable flag computed by the server (TimeSeries servers can determine when a partition is no longer receiving writes), and a reserved version for future invalidation. (Source: this post — verbatim Kafka event JSON.)

  6. Only immutable partitions are split, on purpose. "Although splitting mutable partitions is possible, it is inherently more complex. As a first step towards solving this problem, we chose to reduce the surface area of this change by focusing on immutable partitions, while still meaningfully reducing caller timeouts." This is canonical reduce-surface-area discipline applied to a high-blast-radius migration — the design choice to defer mutable-partition splits to future work is named explicitly.

  7. Splitting goes through a checkpointed planner. The Planner reads the entire partition once to compute an accurate split plan; partial reads can be resumed from the last saved checkpoint stored in a wide_row metadata table that doubles as the runtime route table. The default split strategy is EventBucketPartitionSplitStrategy which assigns more event buckets to the same time bucket. "If the partition is ultra-wide, we cap the number of event buckets we split into, in order to control the resultant read amplification. Spreading into multiple partitions in such cases is still beneficial in order to spread the read workload to multiple Cassandra replicas." (Source: this post — read-amplification cap is the load-bearing engineering choice.)

  8. Splits are validated with pre/post checksums; mismatch ⇒ not COMPLETED. "The Planner stores a pre-split checksum of a given partition during the planning phase, while the Splitter computes and stores the post-split checksum. The split status is marked as completed only if the two checksums match." Beyond that, Netflix uses Data Bridge Spark jobs offline to verify split data is an exact match against the original — defence-in-depth correctness validation. (Source: this post.)

  9. Reads transparently route to split partitions via a Bloom-filter gate + metadata lookup. Every TimeSeries server periodically loads partition keys of completed splits into in-memory Bloom filters. Each read consults the filter; on a hit, the server reads the wide_row metadata table (read-through-cached for performance) which encodes both what to read (pre_split_data block: time_slice, time_series_id, time_bucket, event_bucket) and where to read it from (post_split_data block: target time_slice table name like wide_data_20260328_0, plus a strategy block — event_bucket_partition_strategy: target_event_buckets: 2, start_event_bucket: 32). The Bloom filter check costs single-digit microseconds, "making this diversion practically invisible to the callers." (Source: this post — verbatim metadata JSON shown.)

  10. The pre-split (original) partition is never deleted. "The existing wide partition from the original time slice is never deleted. This helps us in creating safe fallbacks in many different scenarios of partial failures and eventual consistency. The slightly larger storage space we use as a result is worth the operational safety we gain." The split table reuses the same schema as the original time-slice table, allowing Netflix to delegate split-table reads to the existing PartitionReader with no code duplication. (Source: this post.)

  11. Phased rollout staged the read-path cutover with a shadow-comparison phase. Three named phases per dataset progress only when the prior mode passes checks; the Comparison phase runs the new and old read paths in parallel and compares bytes served — "a chart of bytes match vs bytes differ in a given shadow period." Only after Comparison stays clean does the dataset advance. This is canonical phased read-mode rollout with byte-level shadow comparison gating advancement. (Source: this post — "Advance through Read modes once previous mode passes checks.")

  12. Headline outcomes. Average read latency on wide partitions dropped "from seconds … to low double-digit milliseconds." Tail latency dropped "from several seconds … to around 200 ms or better." Read timeouts dropped to near zero. Cluster CPU dropped, with "little to no thread queuing." The team paginated and queried 500 MB+ partitions while remaining available — the gRPC SearchEventRecords example trades elevated latency ("time_taken: 41.072410142s") for being available rather than timing out, reproducing SLO-aware partial response semantics on extreme partitions. (Source: this post.)

Architecture: dynamic-partition-split async pipeline

Detection (on the read path)

Every TimeSeries read tracks bytes-read for the partition. If the threshold is exceeded:

{
  "time_slice": "data_20260328",     // the Cassandra table this event was detected in
  "time_series_id": "profileId:123",
  "time_bucket": 7,
  "event_bucket": 2,
  "immutable": true,                 // computed by the server
  "version": "0"                     // reserved for future invalidation
}

— published to Kafka. The detection happens on reads because "the majority of the data in the wild doesn't need this treatment." The cost is that "some reads on these large partitions may suffer sub-optimal performance for a very short duration (typically seconds) until this process catches up."

Planning (single full read of the partition + checkpointing)

A planner consumes the Kafka events, reads the entire wide partition once to compute an accurate split plan (planner reads can be resumed from the last saved checkpoint), and writes plan + split metadata into the wide_row table. Pre-split checksum is stored at this stage.

Splitting (write to a separately-named time-slice table)

Default split strategy: EventBucketPartitionSplitStrategy"we split the partition by assigning more event buckets to the same time bucket." For ultra-wide partitions, the number of event buckets is capped to control resultant read amplification — splits still spread the load across multiple Cassandra replicas, but a single read won't fan out to an unbounded number of partitions. Post-split checksum is computed and matched against the pre-split checksum; status flips to COMPLETED only on match. The new partitions live in a new table (e.g. wide_data_20260328_0) — distinct from the original data_20260328. The original partition is never deleted.

Serving reads (Bloom-filter gate + metadata table + fallback)

read(ts_id, time_range) →
  Bloom-filter check on (ts_id, time_bucket, event_bucket) ──┐
  ┌── miss ──→ original PartitionReader on data_20260328 ◄─── ┘
  └── hit ──→ wide_row metadata read-through-cache lookup
              ↓ returns pre_split + post_split blocks
              dispatch to existing PartitionReader on wide_data_20260328_0
                        (same schema as original)

wide_row metadata payload (verbatim from the post):

{
  "pre_split_data": {
    "time_slice": "data_20260328",
    "time_series_id": "6313825",     // What to read
    "time_bucket": 0,
    "event_bucket": 2
  },
  "post_split_data": {
    "time_slice": "wide_data_20260328_0",   // Where to read it from
    "event_bucket_partition_strategy": {    // Strategy to delegate to
      "target_event_buckets": 2,
      "start_event_bucket": 32              // How should the strategy read it
    }
  }
}

Architecture: table-level auto-tuning control loop

Background worker (DynamicTimeSliceConfigWorker):

  1. Polls Cassandra virtual tables that expose the nodetool tablehistograms percentile distribution per table.
  2. Computes a partition density adjustment factor when the observed distribution deviates from the configured target (typically 2–10 MiB).
  3. Proposes a new partitioning config for the next Time Slice:
    DynamicTimeSliceConfigWorker:
      namespace: my_dataset_1
      Observed: TimeSlices have p99 partitions below configured target of 10MB.
      Proposed: time_bucket interval: 60s -> 604800s
    
  4. Writes the config so future slices use it. Past slices are unaffected.

This fixes concepts/over-partitioning (the 60s → < 10 KB partition example) as well as under-partitioning. Pairs with patterns/bucketed-event-time-partitioning — the auto-tuner is the control loop that picks the bucket parameters dynamically.

Operational numbers

Metric Before After dynamic split
Wide-partition average read latency Seconds Low double-digit ms
Wide-partition tail latency Several seconds ~200 ms or better
Read timeouts Steady stream Near zero
Cluster CPU High Low
Thread queuing Present Little to none
Maximum paginated partition size Constant timeouts / unavailability blips 500 MB+ partition queryable in 41 seconds while available

Configured partition density target: 2 MiB to 10 MiB depending on workload (table-level auto-tuner target band).

Bloom-filter check latency for read diversion: single-digit microseconds or better"making this diversion practically invisible to the callers."

Caveats

  • Mutable partitions are not handled in this iteration. "There is more work planned around this feature, like splitting mutable wide partitions, or re-processing previously failed splits." The post explicitly defers mutable-partition splits as future work. Splitting mutable partitions is "inherently more complex" — the design choice is to reduce surface area first.
  • Detection lag is real. Detection happens on reads, so "some reads on these large partitions may suffer sub-optimal performance for a very short duration (typically seconds) until this process catches up." Worst-case readers eat the wide-partition latency a few times before the split catches up.
  • Storage is duplicated for safety. Original wide partition is preserved as fallback indefinitely. "The slightly larger storage space we use as a result is worth the operational safety we gain." Quantitative storage overhead is not disclosed.
  • Read amplification is the load-bearing trade-off. Splitting one partition into N event-bucket partitions increases per-read fan-out by N×. The post caps N for ultra-wide partitions to prevent runaway amplification, but does not disclose the exact cap, the heuristic, or the workload characteristics that drove the choice.
  • Checksum strategy details are not disclosed — algorithm, sample size, whether the checksum covers row data only or also tombstones / timestamps / TTLs is not specified. Defence-in-depth is provided by Spark verification jobs run via Data Bridge, which is also un-deep-disclosed in this post (a 2026 "Data Bridge: How Netflix Simplifies Data Movement" post is linked but not summarised here).
  • Bloom-filter sizing is monitored but not parameterised. The post says "the size of the Bloom filters is monitored to ensure we have enough memory per server" and asserts "the filters fit comfortably in each server instance" but does not disclose target false-positive rate, hash count, or bit-array size.
  • No discussion of cross-region / replication implications. TimeSeries runs across multiple regions on Cassandra; the split's relationship to multi-region replication, repair (nodetool repair against split tables), and disaster-recovery semantics is not covered.
  • No public capacity-planning model link. The Monte-Carlo provisioning code is referenced via a github.com link but the modelling assumptions and accuracy disclosure are deferred to a linked AWS re:Invent talk by an unnamed colleague.
  • Block-IDs is described but not the operator UX around it. Manual list management implies an operator workflow that's not described — how do you find the right TSIDs to block? How are they un-blocked?

Source

Last updated · 542 distilled / 1,571 read