Skip to content

CONCEPT

Automated backup validation

Definition

Automated backup validation is the property that every backup produced by a database platform is proven to be restorable as part of the backup creation pipeline — not merely written to durable storage and assumed correct. The strongest form proves restorability by actually restoring the backup to a dedicated node and snapshotting that node.

This closes the most common silent failure mode in disaster recovery: backups that exist, look fine, and are unrestorable when you need them (due to corruption, missing files, encryption-key loss, version-incompatibility, etc.). Schrödinger's backup — the state of any backup is unknown until you try to restore it — is folklore wisdom in the ops community because this failure mode is that common.

The PlanetScale instance — restore + replay

Brian Morrison II's canonical PlanetScale framing (2024-01-24):

"While both PlanetScale and Aurora support automated backups, we also validate the backups of our databases automatically every single time a new backup is created. This is only possible because we use the traditional approach for MySQL replication. Instead of creating a fresh snapshot of your database every time a backup is performed, we restore the most recent backup of your database to a special MySQL node in the cluster that's dedicated to this process. Once the backup is restored, we use the built-in MySQL replication to copy the latest changes into this node before creating a new backup. If a backup is unhealthy, this process will fail and a fresh backup will be triggered to take its place. By following this process, you can always be confident that backups on our platform are validated and healthy to restore from." (Source: )

The pipeline is:

  1. Provision a dedicated backup-taking node in the cluster.
  2. Restore the most recent backup into it (this is the validation step — a corrupt backup fails here).
  3. Catch up the restored node via MySQL binlog replication to the current primary state.
  4. Snapshot the caught-up node as the new backup.
  5. Fail-closed on validation error: if any step fails, the existing backup is flagged as unhealthy and a fresh backup is triggered.

Under this design, every published backup has been restored at least once in the recent past — there is no "we've never tried this" state.

See patterns/validated-backup-via-restore-replay for the full pattern page and patterns/dedicated-backup-instance-with-catchup-replication for the related architectural shape.

Why "only possible because we use traditional replication"

Morrison's framing is that the approach is only feasible on traditional MySQL binlog-replicated clusters — because you need:

  • A full-copy replica substrate to stand up a restoration target whose only job is backup-taking.
  • Binlog replication to catch the restored node up to the current primary state.
  • The backup format to be a byte-for-byte restorable snapshot of MySQL's data directory (not an incremental log of segment acks against a distributed storage appliance).

Aurora's substrate — redo-log-forwarding to distributed storage segments with read-only compute nodes that don't hold their own data copies — doesn't admit this validation shape naturally; Aurora's backups are storage-layer snapshots that the platform trusts without the restore-replay proof.

Seen in

  • — Brian Morrison II (PlanetScale, 2024-01-24). Canonical wiki disclosure of PlanetScale's restore-replay backup validation pipeline and the architectural claim that this approach is only feasible on traditional binlog-replicated clusters.
  • Shortest-form customer-facing framing. Sam Lambert (PlanetScale CEO, 2023-06-28) names automated backup validation as the tier-6 layer in PlanetScale's seven-layer data-safety envelope. Verbatim: "All PlanetScale databases have a mandatory backup schedule included with every database plan at no additional cost. … To ensure our backups are valid, each new mandatory backup restores from a previous backup to validate that it was taken properly and ensure that there is always at least one healthy backup before your database's binary logs are rotated out." Canonical wiki framing: five orthogonal properties named in one sentence — mandatory (not optional)
  • every-plan (not premium-tier) + no-additional-cost (not surcharged) + restore-replay-validated (not just written-to-durable-storage) + binlog-rotation-gated (the load-bearing operational property). The binlog-rotation-gated property is subtle: binary logs can only be deleted from disk "before … are rotated out" once there is a validated backup that includes the data they describe — so the validation pipeline is a binlog-retention gate preventing the operational footgun where binlogs rotate out before the next backup is validated. Lambert's "application bugs that delete data and can go undetected for a long time" motivation names the slow-bug class as distinct from fast-incident recovery.
Last updated · 542 distilled / 1,571 read