Skip to content

CONCEPT Cited by 1 source

Dual-write branch migration

Dual-write branch migration is the technique of migrating a sharded database to a new cluster (usually with a different shard count) by provisioning the destination cluster empty, dual-writing into it alongside the source for the retention window, and cutting reads over once the destination has accumulated a full retention's worth of data and been independently validated.

The property it exploits: if the system only retains data for N days, the destination cluster catches up to equivalence with the source after exactly N days of dual-writes — no historical backfill step required.

(Source: sources/2026-04-21-planetscale-storing-time-series-data-in-sharded-mysql-to-power-query-insights.)

Canonical PlanetScale Insights application

Rafer Hazen, 2023-08-10: "Insights originally shipped with four shards, but we increased this to eight earlier this year to keep up with increased write volume and to build operation experience resharding. Vitess can re-shard an actively used database, but we opted to provision a new, larger, PlanetScale database when we needed to increase the number of shards. Since Insights currently stores eight days of data, we provisioned a new set of consumers, let the new branch receive duplicate writes for eight days, and then cut the application over to read from the new database."

Timeline:

  • Day 0: provision new 8-shard cluster. Start a new consumer fleet that reads the same Kafka topics as the existing 4-shard consumers but writes to the new cluster. Both clusters now receive writes; the old one is still authoritative for reads.
  • Day 0 → Day 8: new cluster accumulates data. Operators run load tests + resource utilisation checks on the new cluster without risk because reads still go to the old one.
  • Day 8: new cluster has 8 days of data = the full retention window. Cut reads over.
  • Day 8 → Day 9+: old cluster is retired.

Why this beats Vitess live reshard

Vitess is capable of live reshard (systems/vitess-movetables + VReplication + VDiff + routing-rule swap cutover). PlanetScale operates this mechanism for customer databases. Yet for Insights' own storage they deliberately chose the dual-write-new-cluster path:

  1. Operator experience. "build operation experience resharding" — the canonical framing. Running two distinct reshard mechanisms on different parts of the fleet gives the team confidence in both.
  2. Clean validation. The destination cluster is independently validated via load tests / prod metrics "before placing it in the critical path." Live reshard shares the source cluster's production traffic throughout the migration.
  3. Rollback simplicity. If the new cluster exhibits unexpected behaviour during the dual-write window, rollback is "don't cut over." No need to tear down a VReplication stream.
  4. Short-retention amplification. The retention-window pattern is only viable when retention is short. For a store retaining weeks / months / years, the "new cluster accumulates the whole history" path is prohibitively long, and live reshard with historical backfill becomes the correct choice.

Relationship to blue-green database deployments

Dual-write branch migration is structurally similar to Aurora blue/green deployments and PlanetScale's own database branching — all three share the "provision isolated destination, sync data from source, cut over after validation" shape. The differences:

  • Blue/green (Aurora): source-side binlog replication streams data to the clone.
  • Database branching (PlanetScale): copy-on-write storage fork.
  • Dual-write branch migration (this concept): both clusters receive writes from the same upstream Kafka topic; no source-side replication involved — the upstream stream is the shared source of truth.

The dual-write variant is only available when there's a durable upstream stream that can be tapped twice. For Insights the stream is Kafka; in other domains it might be a CDC feed, an event bus, or an application-level dual-write commit path.

Scope — retention-bounded store

This mechanism only works cleanly for retention-bounded stores: telemetry, caches, materialised views, denormalised projections. For authoritative primary stores (user data, billing data) you cannot rely on the "wait-for-retention-to-age-in" property because there is no retention — the source data is unbounded.

Seen in

Last updated · 470 distilled / 1,213 read