CONCEPT Cited by 2 sources

Shard-parallel backup¶

Definition¶

Shard-parallel backup is the architectural property of a horizontally-sharded database where backup wall-clock time scales down as N × per-server-bandwidth where N is the shard count. Each shard's backup runs independently on its own compute path — its own primary/replica set, its own ephemeral backup worker, its own network pipe to object storage. Aggregate throughput is the sum of per-shard throughput; per-shard throughput remains roughly constant as N grows. The same property applies in reverse to restore.

The key insight: backup is embarrassingly parallel under horizontal sharding. Unlike queries (where cross-shard fan-out has latency tax and coordination cost), backups touch rows on disjoint shards in any order. No shard needs to wait for any other.

Worked numbers (PlanetScale, 2024)¶

From Ben Dicken's production data (Source: sources/2026-04-21-planetscale-faster-backups-with-sharding):

Shape	Size	Shards	Per-shard	Wall-clock	Aggregate	Per-shard throughput
Unsharded	161 GB	1	161 GB	30m 40s	~176 MB/s	~176 MB/s
Sharded	20 TB	32	~625 GB	1h 39m 4s	~6.7 GB/s	~210 MB/s
Sharded	~230 TB	256	~900 GB	3h 37m 11s	~35 GB/s	~137 MB/s

The 20 TB case is the load-bearing illustration: naive extrapolation of the unsharded throughput predicts 63 hours for a 20 TB backup, which would overlap with the next scheduled backup or force days of backup delay. The sharded result of 1 h 39 min is ~38× faster — not because individual servers got faster, but because 32 of them ran in parallel.

Why it's load-bearing¶

Backup cadence is a production-availability constraint. A 12-hour backup cadence requires backups to complete in well under 12 h; a 63-hour backup on a 12-hour cadence means overlapping backups, which either (a) queue up, producing 2–3 day gaps where no backup is taken, or (b) run simultaneously, saturating server and network resources. Both failure modes are structural, not tunable.

Shard-parallel backup turns backup wall-clock into a per-shard-bandwidth-limited property rather than a total-database-size-limited property. Adding shards to scale writes also automatically scales backup, without requiring separate backup infrastructure to scale.

Restore inherits the property¶

The same parallelism applies to full database restore: each shard restores independently. "This allows the restoration of a massive database to take mere hours rather than days or weeks" (Source: sources/2026-04-21-planetscale-faster-backups-with-sharding).

Restore parallelism is strictly more operationally important than backup parallelism: a slow backup is a background issue; a slow restore is an extended outage. Shard-parallel restore converts a days-long disaster-recovery window into an hours-long one.

Composition¶

The property composes on top of snapshot-plus-catchup-replication applied per-shard. Each shard runs the same consistent-snapshot + GTID-capture + row-copy + binlog-catchup protocol the 2026-02-16 VReplication post canonicalises — just N instances of it in parallel.

The production instance at PlanetScale is the dedicated-backup-instance pattern per shard: each shard spins up a fresh VTBackup instance, restores the previous backup, catches up from the primary's binlog, and writes the new full backup to object storage.

Per-shard throughput is approximately constant¶

The measured per-shard throughput ranges are tight: ~176 MB/s (unsharded, 1 shard) → ~210 MB/s (32 shards) → ~137 MB/s (256 shards). The variation reflects shard hardware differences, not parallelism overhead. The property's scaling factor is N, not anything sub-linear.

Caveats¶

Not all backup architectures parallelise automatically. Systems that centralise backup through a single coordinator (e.g., a central dump server reading from all shards) serialise at the coordinator and lose the property. Shard-parallel backup requires per-shard backup workers with independent network pipes to storage.
Cross-shard consistency is lost. Each shard's backup is consistent as of that shard's own GTID / snapshot. A backup of shard 1 at time T1 and shard 2 at time T2 is not a point-in-time snapshot of the whole database. For consistent cross-shard backup, a freeze-point-across-shards mechanism is needed — which PlanetScale does not appear to use at this altitude; each shard is independently point-in-time consistent.
First backup is more expensive. Steady-state backups only send the catchup delta (~12–24 h of changes); the first-ever backup must send the full shard contents from the primary.
Object storage is the aggregate-throughput ceiling. At 35 GB/s aggregate (230 TB / 256 shards), the storage bucket's ingestion rate becomes the next bottleneck. Amazon S3 and Google GCS both auto-scale but have prefix-level throttling that can bite if all shard backups write to similar-prefix keys.
Per-shard throughput depends on hardware + network. The 137–210 MB/s range reflects PlanetScale's specific instance types; other deployments will measure differently.

Seen in¶

sources/2026-04-21-planetscale-faster-backups-with-sharding — canonical wiki disclosure. Three production data points (161 GB / 20 TB / 230 TB) with measured wall-clock + aggregate + per-shard throughput. Seven-step dedicated-backup-instance pattern naming VTBackup + Singularity as the per-shard components. Restore inherits the same property. 20 TB / 32 shards / 1h 39m is the canonical worked instance.
sources/2026-04-21-planetscale-database-sharding — Ben Dicken's 2025-01-09 interactive primer frames the property numerically at a teaching altitude: 4 TB at 100 MB/s is 11 h unsharded, 4 × 1 TB shards in parallel is ~2.7 h. Named as one of the "surprising benefits of sharding". The 2024-07-30 faster-backups post (canonicalised here) is the production-measured companion.