PLANETSCALE 2024-07-30

PlanetScale — Faster backups with sharding¶

Summary¶

Ben Dicken (PlanetScale, 2024-07-30) canonicalises PlanetScale's production backup architecture for sharded MySQL databases and the shard-parallel backup property — backup wall-clock time scales down roughly as N × per-server bandwidth where N is the shard count, turning a multi-day backup on a monolithic 20 TB database into a ~1.6 h backup on the same data spread across 32 shards. The post is the canonical architectural disclosure of the seven-step dedicated-backup-instance with catchup-replication pattern that PlanetScale uses internally, and names two previously-undisclosed Vitess/PlanetScale components: VTBackup (the Vitess program that orchestrates the per-shard backup) and PlanetScale Singularity (PlanetScale's internal infrastructure-management service that spins up ephemeral compute for each backup).

The load-bearing framing, verbatim:

"Each shard can complete steps 2-7 in parallel. This parallelization allows backups to be taken quickly, even for extremely large databases."

The seven-step per-shard backup choreography is:

Internal PlanetScale API initiates a backup request for the database (production branch + all dev branches).
PlanetScale Singularity spins up a new compute instance in the same cluster as the primaries and replicas. This instance will run VTBackup, which manages the backup process.
If a previous backup exists (common case), it's restored to the dedicated VTBackup instance via the Vitess builtin backup policy. Retrieved from Amazon S3 or Google GCS depending on cloud; decrypted at rest.
VTBackup spins up a new MySQL instance running atop the fetched backup.
VTBackup instructs the new MySQL to connect to the primary VTGate, request a checkpoint-in-time, and replicate all changes between the last backup and the checkpoint (typically a small % of total DB size).
After catch-up completes, the MySQL instance that managed catch-up replication is stopped.
Regular Vitess backup workflow starts, storing the new full backup to S3/GCS.

The structural insight: the backup is produced on a dedicated ephemeral instance that never served production traffic — the production primary is read from only for the catchup delta, not the full backup contents. This is why the architecture uses a restore-then-catchup-then-snapshot flow rather than a direct-from-primary dump: every backup after the first avoids sending the "full database contents" from the primary, sending only what changed in the last 12 or 24 hours.

The shard-parallel-backup property is documented with three measured production instances:

161 GB unsharded (8 vCPU + 32 GB RAM primary + 2 replicas): 30 min 40 s, ~176 MB/s aggregate throughput (approximated via (prev_backup_size + new_backup_size) / duration, accepted as rough).
20 TB sharded across 32 shards (each shard ~625 GB with comparable resources to the 161 GB example): 1 h 39 min 4 s at ~6.7 GB/s aggregate = ~210 MB/s per shard. Naïve extrapolation would predict 63 h; the real result is ~38× faster, entirely from parallelisation.
Large sharded database (implied ~230 TB) across 256 shards (each shard ~900 GB): 3 h 37 min 11 s at ~35 GB/s aggregate = ~137 MB/s per shard.

Naming a core production design choice explicitly: VTBackup catches up by replicating from the primary VTGate, not from a replica. The trade-off (canonicalised as new concepts/primary-vs-replica-as-replication-source concept) is explicit in the post:

"If taken from a primary, it will have the most up to date information. If taken from the replica, we avoid sending additional compute, I/O, and bandwidth demand to the primary server. However, in our case, the primary is already performing replication to two other nodes. Also, unless it is the first backup, the primary does not need to send the full database contents to the backup server. It only needs to send what has changed since the last backup, ideally only 12 or 24 hours prior. Thus, having the backup server replicate from the primary is typically acceptable from a performance perspective."

Restore inherits the same parallelism: a full restore of a 20 TB / 32-shard database also runs ~32-way parallel, turning a multi-day restore into a few hours.

The post also canonicalises backup's non-obvious load-bearing roles beyond disaster recovery, all of which depend on the sharded-parallel property to be operationally viable:

New replica creation: when a primary fails and a replica is promoted, the new replica is seeded from a backup (restored onto a fresh empty server), then caught up from the new primary via replication. Without backups, the full DB would have to be replicated from the primary — "a long time and a negative impact on performance."
Snapshot-in-time recovery for accidental deletes: customer case study (Dub) — end-user accidentally deleted rows; no soft-delete in the app; data restored from backup. Canonical production instance of the soft-delete-vs-hard-delete trade-off made at the backup layer: hard-delete is fine if backups give you a time-travel escape hatch.
Point-in-time recovery: Vitess PITR requires backups as the base image + binlog tail.

Key takeaways¶

Shard-parallel backup is the canonical wiki property: backup wall-clock time scales down as N × per-server-bandwidth where N is the shard count. 20 TB across 32 shards backs up in 1 h 39 min (6.7 GB/s aggregate); the same 20 TB unsharded would take ~63 h by extrapolation. Per-shard throughput is nearly constant (~210 MB/s at 32 shards, ~137 MB/s at 256 shards); the aggregate-throughput gain is entirely from parallelisation. Canonical page: concepts/shard-parallel-backup.
The seven-step dedicated-backup-instance pattern is the canonical architectural shape. PlanetScale's per-shard backup runs on a fresh ephemeral compute instance (not the primary, not an existing replica) that: (a) restores the previous backup, (b) spins up a MySQL on it, (c) catches up via replication from the primary VTGate, (d) takes a new full backup. Every step after the first avoids sending the full DB from any production node. Canonical page: patterns/dedicated-backup-instance-with-catchup-replication.
Primary-as-replication-source is a deliberate PlanetScale design choice with a named trade-off. The alternative (replicate from a replica) saves primary CPU/IO/bandwidth but gives slightly staler catchup checkpoints. PlanetScale accepts the primary-load because post-first-backup the catchup delta is small (12–24 h of changes), and the primary is already replicating to two other replicas anyway. Canonical page: concepts/primary-vs-replica-as-replication-source. Verbatim mitigation: "If this performance hit becomes an issue, backups can be scheduled to happen during lower traffic hours."
VTBackup is the canonical Vitess program managing the per-shard backup. Dedicated wiki page: systems/vtbackup. Uses the Vitess builtin backup engine (physical, not mysqlshell logical). Stores to Amazon S3 or Google GCS. Backups are encrypted at rest. builtin is PlanetScale's production default; the mysqlshell engine (Slack-contributed, Vitess 21 experimental) is the logical-backup alternative.
PlanetScale Singularity is the internal infra service that spins up ephemeral backup compute. New canonical wiki page: systems/planetscale-singularity. The post is the first public architectural disclosure naming Singularity as the PlanetScale-internal compute-orchestration layer under the backup workflow. It's the primitive that makes "spin up a fresh instance per backup" a cheap operation rather than a per-backup capacity-planning exercise.
Backups are multi-role production infrastructure, not just disaster-recovery. Three canonical roles canonicalised: (a) new-replica seeding after primary failure (replica is restored from backup, then caught up via replication — avoids re-replicating the full DB from the primary); (b) point-in-time recovery via Vitess PITR (backup + binlog tail); (c) accidental-deletion recovery via backup restoration to a development branch for cherry-picking — the hard-delete escape hatch that makes app-layer soft-delete optional.
Restore inherits the same shard-parallel property as backup. Each shard restores independently. Turns a multi-day full-database restore into hours. "Though full database restores should be a less common operation, it's good to know that it can be accomplished quickly in case of an emergency." The symmetry of backup and restore parallelism is the load-bearing availability property — an unsharded 20 TB restore that takes days is a measurably worse outage than a sharded 20 TB restore that takes ~2 h.
12-hour default backup cadence is configurable. PlanetScale's internal API routinely checks for pending scheduled backups and starts them when necessary; the user-visible Backups page in the PlanetScale UI exposes both the schedule and on-demand trigger. Self-managed Vitess clusters configure the backup engine + storage service directly — the canonical docs pointer is https://vitess.io/docs/user-guides/operating-vitess/backup-and-restore/overview/.
PlanetScale primary-replica architecture is re-canonicalised as the default unsharded-Base-plan shape: "By default, all query traffic is handled by the primary MySQL instance. The replicas primarily exist for high availability, but can also be used for handling read queries if desired." — reinforces the per-shard replica set pattern from the 2025-01-09 sharding primer.

Architectural numbers¶

30 min 40 s — unsharded 161 GB backup (8 vCPU + 32 GB RAM primary + 2 replicas); previous backup was 163 GB.
~176 MB/s — aggregate throughput for the unsharded case, computed as (prev_backup_size + new_backup_size) / duration = (163 GB + 161 GB) / 30m40s. Dicken flags this as a rough approximation that doesn't account for compression, catchup replication, schema, or throttling.
1 h 39 min 4 s — 20 TB sharded across 32 shards (each ~625 GB); per-shard resources comparable to the 161 GB single-shard example.
~6.7 GB/s aggregate / ~210 MB/s per shard — 20 TB/32 shard throughput.
3 h 37 min 11 s — ~230 TB sharded across 256 shards (each ~900 GB).
~35 GB/s aggregate / ~137 MB/s per shard — large-shard-count throughput.
12 h — default backup cadence; configurable.
63 h — naïve extrapolation of what the 20 TB backup would take unsharded (the "completely unacceptable" counterfactual the sharded case avoids).
2 — default replica count per shard on PlanetScale's unsharded Base plan.
1 primary + N replicas + 1 ephemeral VTBackup instance — per-shard compute shape during backup.

Caveats¶

Pedagogical voice, not architectural deep-dive. The post is built around worked numerical examples and a screen-reader-style walkthrough of the PlanetScale backup UI + seven-step architecture. Missing: (a) internal mechanics of VTBackup (how it coordinates with VTGate, how checkpoint-in-time is expressed — is it a GTID? a timestamp? both?); (b) how Singularity chooses where to spin up the backup instance (same AZ as primary? cross-AZ? uses spot?); (c) what happens if the backup-instance restore fails partway; (d) what happens if catchup falls behind binlog retention; (e) per-shard crash-recovery semantics if one shard's backup fails in the middle of a cluster-wide backup.
"Aggregate throughput" formula is a rough approximation. Dicken explicitly flags: "The formula above is only to be used for rough approximations. It is not a precise calculation. It does not take into account data compression, catch-up replication, schema structure, throttling, and other factors that affect data size and backup speed." The 6.7 GB/s and 35 GB/s numbers are illustrative, not a rigorously-measured production benchmark.
Primary-as-replication-source trade-off is not quantified. Dicken says having the backup server replicate from the primary is "typically acceptable from a performance perspective" and gives schedule-during-low-traffic as the mitigation, but provides no measured data on primary CPU/IO/bandwidth impact during backups. Canonical framing is present; canonical numbers are not.
Storage-side failure modes not covered. S3/GCS retrieval during restore can fail; backup upload can fail; encryption-key rotation / compromise-recovery is not discussed. These are real production concerns for any managed-database backup pipeline and are deferred.
No backup-verification discipline disclosed. The post doesn't describe whether PlanetScale periodically restores backups to verify integrity, what checksum / hash validation is performed, or how stale/corrupt backups are detected. Canonical absence — PlanetScale may do this, it's just not disclosed.
Restore parallelism claim is asserted, not numerically demonstrated. "The speed benefits that one gains from backing up in parallel with sharding also apply in reverse when performing a full database restore" is structurally true (each shard restores independently) but no restore duration is given for the 20 TB or 230 TB cases, unlike the backup durations.
"Restore to a development branch, then browse and cherry-pick" — the UX for accidental-delete recovery is mentioned but not architecturally disclosed. How branches consume a backup, how row-level cherry-picking works at the SQL layer, what cut-over semantics exist — all deferred.
Vitess's non-builtin engines not discussed. Vitess 21's Slack-contributed mysqlshell engine supports logical backups, incremental backups, and PITR. The PlanetScale production choice (builtin = physical) is stated but its trade-offs vs logical backups are not analysed here.
First-backup cost is not broken out. The seven-step flow assumes a previous backup exists (common case). The first-ever backup has to send the full DB contents from the primary to VTBackup, which is quantitatively different from subsequent backups. Post doesn't quantify the first-backup-vs-subsequent-backup gap.
The post's backup-scheduling story is production-voice, not algorithm. "Our internal PlanetScale API routinely checks for pending scheduled backups and starts them when necessary" — no cron granularity, no fairness discipline across customers, no queue shape disclosed.

Source¶

Original: https://planetscale.com/blog/faster-backups-with-sharding
Raw markdown: raw/planetscale/2026-04-21-faster-backups-with-sharding-13b8db79.md

systems/vitess — the substrate; VTGate as the replication endpoint for catchup; builtin as the backup engine.
systems/mysql — the storage engine; backup is a MySQL-on-ephemeral-instance snapshot.
systems/planetscale — the vendor voice and product context.
systems/planetscale-metal — Metal clusters inherit the same backup architecture (direct-attached NVMe per shard).
systems/vtbackup — the Vitess program orchestrating the per-shard backup (new canonical page).
systems/planetscale-singularity — PlanetScale's internal infra-management service (new canonical page).
systems/aws-s3 / systems/google-cloud-storage — backup destinations depending on cloud.
systems/vitess-vreplication — the replication primitive used by VTBackup for catchup.
systems/vitess-mysqlshell-backup — the alternative (logical, experimental, Slack-contributed) engine.
concepts/shard-parallel-backup — the canonical property; this post is its canonical wiki disclosure.
concepts/horizontal-sharding — the parent architecture.
concepts/gtid-position / concepts/binlog-replication — the catchup primitives.
concepts/logical-vs-physical-backup — PlanetScale uses physical (builtin); mysqlshell is logical.
concepts/point-in-time-recovery — the PITR-via-backup-plus-binlog feature.
concepts/replica-creation-from-backup — the non-obvious load-bearing use of backups.
concepts/backup-encryption-at-rest — disclosed invariant.
concepts/soft-delete-vs-hard-delete — backup-as-hard-delete-escape-hatch framing.
concepts/primary-vs-replica-as-replication-source — PlanetScale's named trade-off.
patterns/dedicated-backup-instance-with-catchup-replication — new canonical pattern (this post).
patterns/shard-parallel-backup-and-restore — the companion pattern (backup + restore both parallelise).
patterns/snapshot-plus-catchup-replication — the generic primitive this backup architecture composes on.
patterns/per-shard-replica-set — the pre-existing per-shard HA primitive.
companies/planetscale — the vendor.