Skip to content

PATTERN Cited by 1 source

Keep original partition as fallback during split

Definition

The discipline of never deleting the source partition after a successful split — leaving the original wide partition as a passive fallback that the read path can divert to when:

  • The split metadata is unavailable (e.g. metadata-table outage).
  • A split has been marked COMPLETED but eventual consistency has not yet propagated.
  • A bug is discovered in the split / divert pipeline and an operator wants to roll back to original-only reads.
  • A completed split is later invalidated (e.g. partition becomes mutable again — see concepts/immutable-partition).

The trade is storage cost for operational safety. The pattern is named explicitly in Netflix TimeSeries' 2026-06-03 dynamic-partition-splitting disclosure (Source: sources/2026-06-03-netflix-dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloads).

Verbatim from the source

"The existing wide partition from the original time slice is never deleted. This helps us in creating safe fallbacks in many different scenarios of partial failures and eventual consistency. The slightly larger storage space we use as a result is worth the operational safety we gain."

The architectural property is load-bearing, not incidental:

  • The split table has the same schema as the original — both can be read by the same PartitionReader class.
  • The Bloom-filter gate on the read path is a divert mechanism, not a hard cutover — the original is still present and reachable.
  • The wide_row metadata table records both pre_split_data (original location) and post_split_data (split location) — by retaining both, the system can fall back at read time without losing track of the source.

Why fallback matters here

Three classes of failure that this pattern guards against:

1. Eventual-consistency window

The Bloom filter loads partition keys of completed splits periodically — there is a window where a split is COMPLETED but the Bloom filter on a particular server hasn't reloaded yet. During this window, reads on that server route to the original partition. The original must exist for those reads to return correct data.

Even after Bloom-filter loading, the metadata-table read is a read-through cache — there is also a brief window where the cache has not yet seen the metadata. Same protection: original must exist.

2. Partial-failure scenarios

The split pipeline writes to a separate time-slice table. That table's writes can be:

  • Partially written if the splitter crashes mid-write (recoverable from checkpoint, but during recovery reads should fall back).
  • Lost in a region failover (depending on Cassandra topology) — the split table may be lagging in a region.
  • Affected by replication-level mismatch (e.g. split table writes done at LOCAL_QUORUM, original at QUORUM).

In all these cases, falling back to the original partition is a strictly safer choice than failing the read.

3. Bug-tolerant rollback

If a bug is discovered in the splitter (e.g. a clustering-column ordering mismatch that produces post-split-checksum-equal-but-wrong-data, defeating the checksum gate), the operator can disable Bloom-filter loading and the system reverts to original-partition-only reads. Without the original retained, this rollback is impossible.

Composition with checksum validation

The fallback property and the pre/post checksum gate are complementary defences:

Defence What it catches When it fires
Pre/post checksum Splitter logic bugs (lost / duplicated rows) At splitting time, before split is COMPLETED
Original-as-fallback Eventual-consistency windows, partial-failure, post-COMPLETED bugs At read time, on every read
Spark offline verify Subtle correctness bugs the checksum hash function might collide on Hours / days after split
Shadow comparison Read-path implementation bugs During phased rollout per dataset

The four together produce defence-in-depth for what the post explicitly calls "disastrous" failure: serving incorrect reads.

Trade-offs

Pro Con
Bug-tolerant rollback at any time Permanent storage overhead — original never reclaimed
Eventual-consistency-tolerant reads Storage overhead grows with split count
Same-schema split table → fallback uses existing reader code Long-term, the fallback grows stale (the original is the pre-split shape)
Compatible with mutable-partition splits when those ship Requires read path to know about both pre and post locations
Compatible with re-processing failed splits Fallback is latency-correct but partition-shape-wide — slow reads return

When to retire the original

The post does not specify a retirement policy. Reasonable retirement-window candidates:

  • Time-bounded retention. If the original lives in a Time Slice with a retention TTL, it ages out naturally. Until then, fallback is available.
  • Confidence-window retention. Keep the original for N weeks after split COMPLETED, then nodetool repair-and-drop. The window allows operator to catch silent bugs.
  • Never retire. Storage is cheap; operational safety is not.

The Netflix TimeSeries shape implies the time-bounded retention path: each Time Slice has a retention TTL, so the original eventually ages out without explicit deletion.

Sibling patterns

  • patterns/shadow-table-online-schema-change — same shape applied to schema migration: the new and old shapes both exist and the read path can fall back.
  • Blue/green deployment — the deprecated colour is retained for fast-rollback during a migration window.
  • patterns/shadow-then-reverse-shadow-migration — the source is retained as the canonical store while the target is shadowed and validated.
  • CRDT tombstones (concepts/tombstone) — same shape applied at the row level: the deletion marker is retained even after compaction GC.

The shared discipline: the cheapest correctness mechanism in distributed systems is to not throw away the source until you're certain you don't need it.

Seen in

Last updated · 542 distilled / 1,571 read