PATTERN Cited by 2 sources
Shard-parallel backup and restore¶
Problem¶
At multi-TB scale, both backup and restore become wall-clock-bounded by the database-size / per-server-bandwidth ratio. A 20 TB backup at 100 MB/s is ~63 hours; the same restore is ~63 hours. Neither fits within a typical 12-hour backup cadence; a 63-hour restore is an unacceptable outage duration.
Solution¶
Spread the database across N shards. Run backup and restore N-way in parallel. Each shard backs up (and restores) independently on its own compute + network pipe. Aggregate throughput = N × per-shard bandwidth. Wall-clock scales down proportionally.
Both operations are embarrassingly parallel under horizontal sharding — no coordination between shards is needed, no shared resources bottleneck. The same shard-count N that scales writes also scales backup and restore.
Why both directions matter¶
Backup parallelism is an operational nicety; restore parallelism is a disaster-recovery requirement:
- A slow backup is a background concern: it might overlap with the next backup cycle, but it doesn't affect customer-visible availability.
- A slow restore is an extended outage. A 63-hour full-database restore means 2.5 days of downtime during disaster recovery. A 2-hour shard-parallel restore is survivable.
Per the PlanetScale post (Source: sources/2026-04-21-planetscale-faster-backups-with-sharding): "The speed benefits that one gains from backing up in parallel with sharding also apply in reverse when performing a full database restore. All of the same parallelization can be used, each shard individually restoring the data it is responsible for. This allows the restoration of a massive database to take mere hours rather than days or weeks."
Worked numbers¶
From measured PlanetScale production backups:
- 1 shard × 161 GB: 30 min 40 s (unsharded baseline).
- 32 shards × ~625 GB each (20 TB total): 1 h 39 min 4 s (~6.7 GB/s aggregate).
- 256 shards × ~900 GB each (~230 TB total): 3 h 37 min 11 s (~35 GB/s aggregate).
Per-shard throughput stays roughly constant (~137–210 MB/s across these cases). Aggregate throughput scales with shard count.
Restore durations follow the same scaling (not individually measured in the post but structurally guaranteed by the pattern's per-shard-independent property).
Composes with¶
- patterns/dedicated-backup-instance-with-catchup-replication — the per-shard backup implementation at PlanetScale; each shard spins up an ephemeral VTBackup instance.
- patterns/snapshot-plus-catchup-replication — the underlying data-motion primitive.
- patterns/per-shard-replica-set — the per-shard HA primitive that exists independently of backups.
When to use¶
- Multi-TB sharded databases where monolithic backup/restore would exceed operational windows.
- Disaster-recovery SLAs that require hours-not-days RTO.
- Object storage that can absorb high aggregate ingest (S3 / GCS auto-scale but have prefix-level throttling).
When not to use¶
- Non-sharded databases (the pattern doesn't apply).
- Architectures where backup is serialised through a central coordinator (single-dump-server designs lose the parallelism).
- Deployments where cross-shard consistency is required at backup-time (per-shard-independent backups are consistent per-shard but not globally point-in-time).
Caveats¶
- Cross-shard consistency is lost. Each shard's backup is consistent as of its own GTID / snapshot. Backup of shard 1 at T1 and shard 2 at T2 is not a whole-database point-in-time. For cross-shard transactions that span the gap, replay may or may not be consistent.
- Storage-side throttling. At 35 GB/s aggregate into S3 / GCS, prefix-level throttling becomes a concern if all shards write similar-prefix keys.
- Compute provisioning. N ephemeral backup workers need to be launchable in parallel. PlanetScale uses Singularity; self-managed deployments need their own primitive.
Canonical wiki instance¶
PlanetScale's production backup architecture. 256-shard backups at 35 GB/s aggregate is the largest disclosed example; 32-shard 20 TB at 1 h 39 min is the load-bearing teaching example from the Dicken primer.
Seen in¶
- sources/2026-04-21-planetscale-faster-backups-with-sharding — canonical wiki disclosure of both backup and restore parallelism.
- sources/2026-04-21-planetscale-database-sharding — teaching-altitude framing of the property: 4 TB / 100 MB/s = 11 h unsharded, 4 × 1 TB shards in parallel = 2.7 h.