PATTERN

Canary-shard substrate migration¶

Pattern¶

When migrating a horizontally-sharded database fleet to a new storage substrate (e.g. EBS → direct-attached NVMe, gp3 → io2, rotating disk → SSD), migrate one shard first — specifically, the busiest one — let it soak for a few days, then roll the remaining shards if the canary validates. The pattern inverts the naive cautious-rollout instinct (start with the quietest shard) in favour of maximum-signal-first.

The PlanetScale Insights instance¶

"To do this, we picked 1 of our 8 MySQL shards, the busiest one, to upgrade first. … Upgrading a test shard to Metal causes a substantial decrease in latency across all the measured percentiles. After the Metal upgrade, our busiest shard with the highest latencies started executing queries faster than the other shards by a significant margin. After letting the first upgrade soak for a few days, we upgraded the remaining shards and saw nearly identical improvement in performance."

(Source: .)

PlanetScale's Insights pipeline was an 8-shard MySQL cluster driven by 800 concurrent writer threads; the busiest shard was the one with the worst latencies on EBS. Picking it first:

Maximised signal-to-noise — the shard most likely to visibly improve from the substrate swap.
Kept a clean baseline — 7 un-migrated shards remain as a live control group on the same real workload.
Bounded blast radius — each shard is already an independent failure domain, so the migration risk is 1/N of customers.
Validated the new substrate under real load — canary soak under the worst workload is a better test than canary under the easiest workload.

Mechanics¶

Pick the busiest shard — highest per-shard load by the metric the substrate swap is supposed to improve (write latency, read QPS, IOPS saturation). The noisiest shard produces the clearest before/after signal.
Migrate the substrate only — no application, schema, or sharding-configuration changes. Single-variable test.
Observe per-shard percentiles side-by-side — p50 / p90 / p95 / p99 graphs with one line per shard make the substrate swap visible as a single shard diverging from the pack. The Insights post's graphs "the purple line corresponds to our busiest shard" are the canonical visualisation.
Soak for a few days — validate the upgrade under:
Peak daily-cycle load
Any weekly / business-hour load patterns
Backup windows
Replication topology events (failover, reparent)
Any substrate-specific failure modes that only manifest over time (noisy-neighbour, correlated storage events, drive-wear artefacts)
Roll out to remaining shards — if the canary soak is clean, migrate the other N−1 shards. The Insights post's outcome: "we upgraded the remaining shards and saw nearly identical improvement in performance."

Why busiest-first¶

The common alternative (quietest-first) has two flaws:

Low signal: a quiet shard won't show substrate-swap benefits because it wasn't being stressed on the old substrate. Operators can't distinguish "the swap works" from "the load was never high enough to matter."
Worst-case blind spot: if the new substrate has a high-concurrency failure mode, a quiet shard will never trigger it. The canary has to exercise the new substrate to validate it.

Busiest-first is the empirical equivalent of load-testing under production traffic — the canary shard runs the worst- case shape of the workload on the new substrate, so the soak period is a production stress test of the migration.

Composition with sharding topology¶

This pattern is only possible because horizontal sharding already creates N independent failure domains. The migration operation is scoped to a single shard; the rest of the fleet is unaffected. This composes cleanly with:

patterns/sharding-as-iops-scaling — the same shard decomposition that spreads IOPS cost is what enables per-shard substrate migration.
concepts/sharded-failure-domain-isolation — blast radius of a bad migration is capped at 1/N.
patterns/operator-scheduled-cutover — the per-shard cutover is a per-shard Vitess operator operation.
concepts/shard-parallel-backup — if the migration requires a backup/restore to the new substrate, each shard can migrate in parallel at the limit.

When to use¶

Substrate migration (storage type, instance family, region, AZ layout) on a horizontally-sharded fleet.
Any change where per-shard metrics differ significantly — the canary signal requires per-shard visibility.
New substrate has real uncertainty — enough that a days-long soak is valuable vs. an immediate full rollout.

When not to use¶

Un-sharded system. No per-shard decomposition, no canary-shard surface.
Migration cannot be scoped to one shard. If the new substrate requires cluster-wide topology changes, this pattern doesn't apply; use [[patterns/progressive-delivery- per-database|progressive delivery per database]] instead.
Uneven shards. If shard sizes differ wildly (1 huge shard + 7 tiny shards), the "busiest" shard may also be the only shard where migration is risky; operators may prefer a medium-load shard first.
Blue/green is available at the shard level. Per-shard blue/green (dual-write + cutover) eliminates the rollback pain of an in-place substrate swap.

Trade-offs vs alternatives¶

Approach	When it's better
Canary-shard busiest-first (this pattern)	Sharded fleet + per-shard independent substrate + real substrate uncertainty
Canary-shard quietest-first	Substrate is known-stable but the migration process is novel
All-shard simultaneous migration	Substrate + migration both known-stable; speed > caution
Blue/green per shard	In-place rollback is impossible; parallel old+new clusters acceptable
patterns/progressive-delivery-per-database	Un-sharded; need customer-level granularity instead

Seen in¶

— canonical instance. PlanetScale's Insights Kafka → sharded-MySQL pipeline migrated from EBS (with provisioned IOPS) to Metal direct-attached NVMe. Busiest of 8 MySQL shards upgraded first; "purple line" in the per-shard percentile plots went from worst to best across p50/p90/p95/p99. Soaked for "a few days", then the remaining 7 shards were upgraded to "nearly identical improvement in performance."