PATTERN Cited by 2 sources

Shard-parallel backup and restore¶

Problem¶

At multi-TB scale, both backup and restore become wall-clock-bounded by the database-size / per-server-bandwidth ratio. A 20 TB backup at 100 MB/s is ~63 hours; the same restore is ~63 hours. Neither fits within a typical 12-hour backup cadence; a 63-hour restore is an unacceptable outage duration.

Solution¶

Spread the database across N shards. Run backup and restore N-way in parallel. Each shard backs up (and restores) independently on its own compute + network pipe. Aggregate throughput = N × per-shard bandwidth. Wall-clock scales down proportionally.

Both operations are embarrassingly parallel under horizontal sharding — no coordination between shards is needed, no shared resources bottleneck. The same shard-count N that scales writes also scales backup and restore.

Why both directions matter¶

Backup parallelism is an operational nicety; restore parallelism is a disaster-recovery requirement:

A slow backup is a background concern: it might overlap with the next backup cycle, but it doesn't affect customer-visible availability.
A slow restore is an extended outage. A 63-hour full-database restore means 2.5 days of downtime during disaster recovery. A 2-hour shard-parallel restore is survivable.

Per the PlanetScale post (Source: sources/2026-04-21-planetscale-faster-backups-with-sharding): "The speed benefits that one gains from backing up in parallel with sharding also apply in reverse when performing a full database restore. All of the same parallelization can be used, each shard individually restoring the data it is responsible for. This allows the restoration of a massive database to take mere hours rather than days or weeks."

Worked numbers¶

From measured PlanetScale production backups:

1 shard × 161 GB: 30 min 40 s (unsharded baseline).
32 shards × ~625 GB each (20 TB total): 1 h 39 min 4 s (~6.7 GB/s aggregate).
256 shards × ~900 GB each (~230 TB total): 3 h 37 min 11 s (~35 GB/s aggregate).

Per-shard throughput stays roughly constant (~137–210 MB/s across these cases). Aggregate throughput scales with shard count.

Restore durations follow the same scaling (not individually measured in the post but structurally guaranteed by the pattern's per-shard-independent property).

Composes with¶

patterns/dedicated-backup-instance-with-catchup-replication — the per-shard backup implementation at PlanetScale; each shard spins up an ephemeral VTBackup instance.
patterns/snapshot-plus-catchup-replication — the underlying data-motion primitive.
patterns/per-shard-replica-set — the per-shard HA primitive that exists independently of backups.

When to use¶

Multi-TB sharded databases where monolithic backup/restore would exceed operational windows.
Disaster-recovery SLAs that require hours-not-days RTO.
Object storage that can absorb high aggregate ingest (S3 / GCS auto-scale but have prefix-level throttling).

When not to use¶

Non-sharded databases (the pattern doesn't apply).
Architectures where backup is serialised through a central coordinator (single-dump-server designs lose the parallelism).
Deployments where cross-shard consistency is required at backup-time (per-shard-independent backups are consistent per-shard but not globally point-in-time).

Caveats¶

Cross-shard consistency is lost. Each shard's backup is consistent as of its own GTID / snapshot. Backup of shard 1 at T1 and shard 2 at T2 is not a whole-database point-in-time. For cross-shard transactions that span the gap, replay may or may not be consistent.
Storage-side throttling. At 35 GB/s aggregate into S3 / GCS, prefix-level throttling becomes a concern if all shards write similar-prefix keys.
Compute provisioning. N ephemeral backup workers need to be launchable in parallel. PlanetScale uses Singularity; self-managed deployments need their own primitive.

Canonical wiki instance¶

PlanetScale's production backup architecture. 256-shard backups at 35 GB/s aggregate is the largest disclosed example; 32-shard 20 TB at 1 h 39 min is the load-bearing teaching example from the Dicken primer.

Seen in¶

sources/2026-04-21-planetscale-faster-backups-with-sharding — canonical wiki disclosure of both backup and restore parallelism.
sources/2026-04-21-planetscale-database-sharding — teaching-altitude framing of the property: 4 TB / 100 MB/s = 11 h unsharded, 4 × 1 TB shards in parallel = 2.7 h.