Skip to content

PATTERN Cited by 2 sources

Validated backup via restore-replay

Problem

Backups that are written to durable storage but never actually restored fail silently. The failure modes are legion: silent corruption, missing files, incompatible binlog format across versions, encryption-key loss, broken log-sequence chains, compression-format drift. Schrödinger's backup — the state of any backup is unknown until you try to restore it — is folklore wisdom precisely because this class of failure is routine.

The standard monitoring answer ("we run a restore drill every quarter") catches the problem only if the drill is actually performed and only for one point in time. What ops teams want is the stronger invariant: every backup has been restored.

Solution

Make the backup creation pipeline itself a restore. Instead of snapshotting the live primary, stand up a dedicated node, restore the previous backup into it, catch it up to current primary state via binlog replication, and snapshot that caught-up node as the new backup. Any failure during the restore or catch-up phase flags the previous backup as unhealthy and triggers remediation.

Under this design, every published backup has been restored at least once as part of being produced — there is no "we've never tried this" state, ever.

The PlanetScale instance

Brian Morrison II's canonical framing (2024-01-24):

"While both PlanetScale and Aurora support automated backups, we also validate the backups of our databases automatically every single time a new backup is created. This is only possible because we use the traditional approach for MySQL replication. Instead of creating a fresh snapshot of your database every time a backup is performed, we restore the most recent backup of your database to a special MySQL node in the cluster that's dedicated to this process. Once the backup is restored, we use the built-in MySQL replication to copy the latest changes into this node before creating a new backup. If a backup is unhealthy, this process will fail and a fresh backup will be triggered to take its place. By following this process, you can always be confident that backups on our platform are validated and healthy to restore from." (Source: sources/2026-04-21-planetscale-planetscale-vs-amazon-aurora-replication)

Mechanics

  1. Dedicated backup-taking node — a MySQL node in the cluster whose only job is backup pipeline (see patterns/dedicated-backup-instance-with-catchup-replication).
  2. Restore the most recent backup into this node. This is the validation step — a corrupt backup, version mismatch, or missing file surfaces here and the whole cycle fails.
  3. Catch up via binlog replication from the current primary: bring the restored node up to date with writes committed since the prior backup was taken.
  4. Snapshot the caught-up node as the new backup.
  5. Fail-closed on error: if restore fails, catch-up gets stuck, or snapshot fails, mark the prior backup as unhealthy and trigger a fresh full-backup cycle from the primary as a fallback.

Consequences

  • Every backup has been proven restorable — the strongest practical backup-integrity guarantee.
  • Corruption / incompatibility caught immediately — not at the next DR drill three months from now.
  • Reuses the replication substrate — no separate validation infrastructure; the backup-taking node is just another replica by another name.
  • Requires traditional replication substrate — doesn't apply to shared-storage substrates like Aurora's redo-log forwarding, where there's no natural place to restore into for validation.
  • Backup throughput is bounded by restore speed — for very large databases, one-restore-per-backup-cycle may be the binding constraint on backup frequency.
  • Dedicated backup node is extra capacity — a full-sized instance (or close to it) for an offline role; amortised across the cluster but not free.

Why "only possible on traditional replication"

Morrison makes the architectural claim explicitly: this validation approach "is only possible because we use the traditional approach for MySQL replication". Aurora's substrate — redo-log forwarding to a distributed storage fabric — has no independent replica substrate to restore into. Aurora backups are storage-layer snapshots that the platform trusts without proving restorability via actual restore. The validation gap is structurally inherent to substrate choice, not a product decision.

Seen in

  • sources/2026-04-21-planetscale-planetscale-vs-amazon-aurora-replication — Brian Morrison II (PlanetScale, 2024-01-24). Canonical wiki disclosure of PlanetScale's restore-replay backup validation pipeline + the architectural claim that this approach is only feasible on traditional binlog-replicated clusters.
  • sources/2026-04-21-planetscale-scaling-hundreds-of-thousands-of-database-clusters-on-kubernetes — Brian Morrison II, 2023-09-27. Earlier (by 4 months) restatement of the same pattern with different framing: "we actually utilize Vitess to create a special type of tablet that's ONLY used for backing data up… restore the latest version of the backup to this tablet, replicate all of the changes that have occurred since the backup was taken, and then create a brand new backup based on that data." Casts the two wins as (a) no production MySQL performance impact and (b) automatic backup validation"this… has the added benefit of automatically validating your existing backups on PlanetScale."
Last updated · 470 distilled / 1,213 read