Skip to content

PATTERN Cited by 1 source

Phased rollout of read mode

Definition

A read-path migration discipline that introduces multiple named read modes (e.g. OFF / SHADOW / COMPARISON / EXEC / ON), advances one dataset (or namespace) at a time through these modes, and only allows advancement when the previous mode passes its checks. Each mode is a defined configuration of which read path executes, which path is shadowed, and what comparison metrics gate the transition.

Distinct from generic feature-flag rollout in two ways:

  1. Multiple intermediate modes, each with a specific validation purpose (the SHADOW mode validates correctness; COMPARISON sustains the validation; EXEC tests latency under real serving; etc.).
  2. Gated advancement, where the metrics from the prior mode must be clean before the next mode is enabled.

Canonicalised on the wiki by Netflix's TimeSeries Abstraction in the 2026-06-03 dynamic-partition-splitting disclosure (Source: sources/2026-06-03-netflix-dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloads).

The Netflix instantiation

"Implementing a phased rollout strategy to safely advance through stages as our confidence in the system grew."

The post explicitly highlights the Comparison phase as load-bearing — "a chart of bytes match vs bytes differ in a given shadow period" — as the gate that determines whether a dataset advances. The full mode progression is implicit in the architecture (OFF → SHADOW with byte comparison → EXEC where new path serves with old still as fallback → ON when fallback is no longer wired in), with each transition requiring sustained green metrics.

The rollout proceeds per dataset / per namespace rather than fleet-wide all at once — confidence is established on lower-risk datasets first, then propagated to higher-risk ones.

Why phase the rollout at all

The dynamic-partition-splitting feature has three properties that make a phased rollout structurally necessary:

  1. High blast radius — incorrect reads on TimeSeries data could affect downstream Counter aggregations, multi-region replicated state, etc.
  2. Per-dataset variability — different datasets have different access patterns, partition shapes, and failure modes; one might pass in shadow mode while another stresses an unhandled corner case.
  3. Hard to test exhaustively offline — partition-splitting outcomes depend on production read patterns + Cassandra cluster state + replication topology.

Phased rollout converts the question "will this work in production?" from a single bet into a sequence of progressively-more-aggressive bets, each gated by metrics from the prior one.

Mode definitions

A typical instantiation:

Mode What runs What's compared What advances
OFF Old read path only nothing manual after testing
SHADOW Both paths run; old returned to caller bytes A vs bytes B sustained match → COMPARISON
COMPARISON Both paths run; old returned to caller sustained match across full traffic profile matches across analytics + peak + interactive → EXEC
EXEC New path returned to caller; old retained as fallback old-path also runs as fallback for failures clean SLO + fallback-rate metrics → ON
ON New path only nothing (terminal — fallback could be re-enabled if needed)

The post does not enumerate this exact set of modes by name (it only mentions Shadow / Comparison / Read modes), but the structural progression is implicit in the description.

Why per-dataset rather than fleet-wide

Each dataset has a different:

  • Workload profile (read-heavy, write-heavy, range-query-heavy).
  • Wide-partition rate (some datasets have many wide partitions, others have none).
  • Tolerance for incorrect reads (some downstreams aggregate, others audit).
  • Operational bandwidth (some teams have on-call coverage, others don't).

Per-dataset rollout lets the team:

  • Start with low-risk datasets (small reader population, clear correctness requirements).
  • Build confidence, and operational experience, dataset by dataset.
  • Roll back per-dataset on any anomaly, without affecting other datasets.

This is canonical phased migration with soak times applied at the namespace level.

Trade-offs

Pro Con
Bug-tolerant: failures in one phase don't propagate fleet-wide Slower fleet-wide deployment than feature-flag fleetwide-flip
Composable with byte comparison for correctness gating Mode plumbing must be threaded through read API and config
Per-dataset cadence matches per-dataset risk profile Operator overhead per advancement decision
Shadow / EXEC modes dual-run paths → operational cost during phases Cost of dual-path execution during phases
Fallback-on-EXEC keeps safety even after cutover More moving parts in production
Confidence builds across datasets Earliest-rolled-out datasets get longer baking; latest get shorter

Sibling patterns

When NOT to use

  • Pure config-only changes that can be flipped instantly with no correctness implications.
  • Datasets with no fallback path — phased rollout requires a working old path during the phase window.
  • Operations with low blast radius — the ceremony of mode plumbing isn't worth it for small-impact changes.

Seen in

Last updated · 542 distilled / 1,571 read