Skip to content

PATTERN Cited by 1 source

Dedicated backup instance with catchup replication

Problem

A production database's backup has two conflicting requirements:

  • Consistent — the backup must be a point-in-time snapshot (or equivalent) so restore produces a usable database.
  • Low production impact — reading N TB of data off the production primary for each backup is expensive in CPU, I/O, and bandwidth, and scales badly with DB size.

Two obvious strategies fail:

  • Dump from primary — consistent (with careful locking or MVCC snapshots) but expensive every cycle; the primary sends the full DB contents to whoever's taking the backup.
  • Dump from an existing replica — saves primary load but consumes replica capacity during every backup and still sends full contents over the replication channel. Replicas that also serve reads can degrade under backup load.

Solution

Spin up a fresh ephemeral compute instance per shard per backup cycle. Seed it from the previous backup in object storage. Catch it up from the primary's binlog. Take the new backup off the ephemeral instance. Destroy the instance when the backup finishes.

The key insight: after the first-ever backup exists in object storage, every subsequent backup only needs the catchup delta (typically 12–24 h of binlog events) to be sent from the primary — not the full DB. The full backup contents are re-sent only once.

Mechanism (PlanetScale production instance)

Per the 2024-07-30 PlanetScale disclosure (Source: sources/2026-04-21-planetscale-faster-backups-with-sharding):

  1. Internal API initiates the backup request (scheduled or user-triggered).
  2. PlanetScale Singularity spins up a fresh compute instance in the same cluster as the shard's primaries and replicas.
  3. The instance runs VTBackup, which fetches the previous backup from S3/GCS via the Vitess builtin backup engine and restores it locally. (Backups are decrypted on arrival — they're encrypted at rest.)
  4. VTBackup spins up a MySQL on the restored data.
  5. That MySQL connects to the primary VTGate and requests a checkpoint-in-time. All binlog events from the previous-backup position up to the checkpoint replicate over.
  6. Catchup completes; the catchup-MySQL is stopped.
  7. Regular Vitess backup workflow runs on the local data, writing the new full backup to S3/GCS.

The new backup is consistent as of the checkpoint-in-time — which is arbitrarily close to the wall-clock time of the backup operation, bounded only by the catchup duration (which is proportional to the binlog delta, not the DB size).

Why it composes well

  • Primary sees only the catchup delta. Every backup after the first avoids full-DB reads from any production node.
  • Ephemeral instance isolates the backup workload. No interference with production reads or replication topology.
  • Storage sits between backups. The "previous backup in S3" is the durable hand-off between consecutive backup cycles; losing an ephemeral instance mid-backup doesn't lose any committed data because the primary's binlog is still there.
  • Parallelises across shards trivially. Each shard runs its own independent pattern instance; no coordinator. Canonical production instance of concepts/shard-parallel-backup.

Primary-as-replication-source trade-off

The post names this explicitly: catchup reads from the primary VTGate, not from a replica.

"If taken from a primary, it will have the most up to date information. If taken from the replica, we avoid sending additional compute, I/O, and bandwidth demand to the primary server. However, in our case, the primary is already performing replication to two other nodes. Also, unless it is the first backup, the primary does not need to send the full database contents to the backup server. It only needs to send what has changed since the last backup, ideally only 12 or 24 hours prior."

Canonical framing: use the primary when the delta is small; use a replica when the primary is resource-constrained. PlanetScale's choice reflects their default shape (1 primary + 2 replicas per shard; primary already doing 2 replication streams; catchup delta is small).

The mitigation, if the primary-load impact becomes problematic: "backups can be scheduled to happen during lower traffic hours."

Composes with

When to use

  • Multi-TB databases where reading full contents from the primary every cycle is unacceptable.
  • Sharded architectures where per-shard parallelism is available.
  • Deployments with low-latency, high-throughput object storage (S3, GCS) as the backup destination.
  • Systems where ephemeral compute can be spun up cheaply (Kubernetes, cloud auto-scaling, an internal primitive like Singularity).

When not to use

  • Single-instance databases small enough that a direct primary dump is tolerable (the overhead of spinning up + restoring dominates the actual backup work).
  • Environments without object storage (this pattern relies on the object store as the durable hand-off).
  • First-backup bootstrap — the pattern collapses to "dump from primary" for the first ever backup, since there's no previous backup to seed from.

Canonical wiki instance

PlanetScale's backup architecture for both MySQL (Vitess) and Postgres (Neki) clusters. VTBackup is the Vitess program implementing the pattern; PlanetScale Singularity is the ephemeral-compute provisioner. Used on all PlanetScale databases, including the largest sharded instances (256 shards) running at 35 GB/s aggregate.

Seen in

  • sources/2026-04-21-planetscale-faster-backups-with-sharding — canonical wiki disclosure. The seven-step choreography is documented explicitly. Measured production performance across 1, 32, and 256 shards validates the per-shard-parallel scaling property. Verbatim: "When working with a sharded database, each shard can complete steps 2-7 in parallel. This parallelization allows backups to be taken quickly, even for extremely large databases."
Last updated · 347 distilled / 1,201 read