Skip to content

PATTERN Cited by 1 source

Subscriber switchover (per-consumer migration cutover)

Subscriber switchover is the cutover pattern where consumers of a dual-running pipeline are moved from the old engine's output to the new engine's output one at a time, with a per-consumer rollback lever. It is the endgame of a shadow migration (patterns/shadow-migration): shadow buys you confidence; switchover cashes it in.

Shape

  1. Keep old and new engines running in parallel ("shadow").
  2. For each consumer (subscriber, table subscriber, downstream service, etc.), provide a routing switch that picks the old or new engine's output.
  3. Move consumers one at a time:
  4. Pick a consumer.
  5. Flip its switch to the new engine's output.
  6. Monitor.
  7. If anything regresses, flip the switch back.
  8. Only once a consumer has been stable on the new output for a defined window, lock them in.

Why per-consumer, not in-place overwrite

The naive cutover is: have the new engine overwrite the old engine's output in place. Risk: if the new engine has a subtle bug, all consumers see corrupt data simultaneously — and the old output is gone. No rollback path except re-running the old engine.

Per-consumer switchover decouples correctness for consumer N from correctness for consumer M. Each consumer's migration is an independent bet with an independent rollback.

From Amazon BDT's Spark → Ray migration:

"To help avoid catastrophic failure, BDT built another service that let them dynamically switch individual table subscribers over from consuming the Apache Spark compactor's output to Ray's output. So, instead of just having Ray overwrite the Apache Spark compactor's output in-place and hoping all goes well for thousands of table subscribers, they can move subscribers over from consuming Apache Spark's output to Ray's output one at a time while maintaining the freedom to reverse course at the first sign of trouble." (Source: sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2)

Ordering heuristics

  • Low-blast-radius first. Start with a consumer whose failure is visible but not catastrophic — an internal dashboard, a low-SLA analytics job.
  • Representative coverage. Move a few consumers of each workload shape so you discover per-shape issues before scaling out.
  • High-confidence consumers that "look like" many others. Pick a small subset whose shape mirrors a large class; a successful migration of that small subset predicts the whole class.
  • Leave high-value consumers until last. Business-critical, user-facing, regulated consumers move after the shape is well-characterised.

Prerequisites

  • Dual output must exist. Without shadow migration having produced both outputs in parallel, there is nothing to switch between.
  • Routing indirection. Consumers must read through an indirection layer that can be flipped without consumer-side code changes.
  • Observability per consumer. You need to know a consumer regressed; global metrics don't catch per-subscriber bugs.

Seen in

Last updated · 200 distilled / 1,178 read