Skip to content

PATTERN Cited by 1 source

Notion four-step sharding migration

Notion's four-step framework for migrating from an unsharded database to a sharded one is Justin Gage's canonical pedagogy framing of a dual-write-plus-verify cutover, with each step named as a distinct phase. Source: Notion's engineering post "Sharding Postgres at Notion" as surfaced in (sources/2026-04-21-planetscale-what-is-database-sharding-and-how-does-it-work):

"1. Double-write: Incoming writes get applied to both the old and new databases. 2. Backfill: Once double-writing has begun, migrate the old data to the new database. 3. Verification: Ensure the integrity of data in the new database. 4. Switch-over: Actually switch to the new database. This can be done incrementally, e.g. double-reads, then migrate all reads."

The four phases

1. Double-write

New writes land in both the old (unsharded) and new (sharded) databases simultaneously. Old reads continue to be authoritative. The goal: the new database starts accumulating the tail of new writes from the double-write start moment onward, without the team having to pre-reconcile the historical data.

Double-write is the first phase not because it's the safest, but because it sets a fixed point in time from which data-level parity becomes possible: everything from T_double_write_on onward is dual-written; everything before is what backfill needs to copy.

2. Backfill

While double-write continues, a background process migrates historical data (everything written before T_double_write_on) from the old database to the new. The backfill reads from the old DB and writes to the new; it runs at a rate chosen to avoid impacting live traffic. On completion, the new DB holds both the historical rows and every write since double-write was enabled.

The ordering matters: double-write first, backfill second. If the order were reversed, writes that happened during the backfill window would be missing in the new database.

3. Verification

Before cutting traffic over, the team runs row-level integrity checks comparing old-DB and new-DB contents — row counts, checksums, sampled row comparisons, targeted query equivalence. Any divergence is investigated and closed before the next phase starts.

The verification phase is the operational gate — the moment the team decides whether the new database is trustworthy enough to become authoritative.

4. Switch-over (incremental)

Writes have been dual for a while; reads still go to the old DB. Switch-over flips reads over to the new DB, usually incrementally:

  • Step 4a. Enable double-reads (new DB reads served but compared against old; discrepancies logged).
  • Step 4b. Flip a fraction of read traffic (e.g. 1% → 10% → 50% → 100%) to the new DB as primary.
  • Step 4c. Stop double-writing once new-DB reads are 100% and trusted.
  • Step 4d. Decommission the old DB.

The double-reads → shift reads → stop double-write order is what makes the cutover non-destructive: at every point along the progression, the team can reverse the flip by sending reads back to the still-dual-written old DB.

Why these four steps in this order

Each phase establishes a precondition for the next:

Phase Establishes Enables
Double-write Fresh writes are in both DBs Backfill has a bounded window
Backfill Historical writes are in both DBs Verification has complete data
Verification Data parity is confirmed Switch-over is safe to start
Switch-over Reads move to new DB Old DB can be decommissioned

Violating the order leaves gaps (backfill before double-write → missing writes in the gap window) or unverified cutovers (switch-over before verification → trusting data you haven't checked).

Sharding vs single-DB migration

Gage's framing: a shard migration has more failure modes than a migration to a single new DB provider. Each phase applies to each shard in the new layout — the mental model is per-shard × 4-phase-cadence, not global-4-phase. The team can progress shards through phases at different rates, bounded by the weakest-shard's verification outcome.

Relationship to adjacent patterns

  • Dual-write migration is the generic pattern; Notion's four-step is the specific composition with explicit verification-before-switch and incremental-switch-over steps. Dual-write-migration's Airbnb OTLP instance used the same "dual-emit → soak → flip" shape without the explicit verify step named.
  • Expand-migrate-contract is the same-shape pattern at the schema-column altitude; Notion's four-step is at the database-cluster altitude. The expand = double-write, migrate = backfill, contract = switch-over-and-decomm correspondence is exact.
  • Reshard-online-via-VReplication is the substrate-layer implementation of the four-step pattern at the Vitess level — VReplication performs backfill-plus-tail; VDiff is verification; SwitchTraffic is switch-over. Teams building on Vitess get the four-step pattern for free from the substrate.
  • Unsharded-to-sharded migration — the migration axis this pattern addresses. Notion hand-built the pattern in 2021; by 2024-era Vitess it's a UI workflow.

When hand-rolling the pattern is still justified

Hand-rolling the four-step is warranted when:

  • The substrate doesn't offer it. Teams on Postgres without Citus, or on custom sharded substrates, still need to hand-build the pattern.
  • The data layer crosses substrates. Migrating from MySQL to a sharded Postgres deployment (or vice versa) can't use either substrate's built-in reshard.
  • The cutover has external dependencies. Search indexes, CDC consumers, analytics pipelines need their own cutovers coordinated with the primary DB cutover.

Caveat: downtime remains a possibility

Gage's closing framing: "Each of these steps still introduces the possibility of downtime; it's just a risk you're going to have to take for changes at this scale." The four-step reduces expected downtime; it does not eliminate the possibility. Teams should have a per-phase rollback plan.

Seen in

  • sources/2026-04-21-planetscale-what-is-database-sharding-and-how-does-it-work — Justin Gage (guest post for PlanetScale, 2023-04-06) surfaces the pattern from Notion's 2021 Sharding Postgres at Notion post and canonicalises the four-step naming + the incremental-switch-over elaboration ("This can be done incrementally, e.g. double-reads, then migrate all reads."). Gage's framing is the load-bearing contribution — the Notion post itself is one layer of indirection from the wiki.
Last updated · 550 distilled / 1,221 read