Skip to content

PATTERN Cited by 1 source

Checkpoint backup to object storage

Problem

A distributed state-distribution system (gossip, CRDT, replicated database) can enter a state where reasoning about the live data is harder than rebuilding it from scratch — contagious deadlocks, schema-migration amplification, retry storms, or simply accumulated corruption. In such moments you need a break-glass option: rebuild the cluster from a known-good state quickly.

Pattern

Maintain periodic full-database checkpoints in object storage (S3, R2, GCS, Tigris, etc.) as a side-path to the live state-distribution mechanism. Properties the pattern needs:

  1. Checkpoints are whole-database snapshots, not deltas — they must be restorable without replay of intermediate state.
  2. Object storage is independently available from the live cluster. Its failure mode is uncorrelated with the cluster's.
  3. Restore is documented and rehearsed. The checkpoint is worth what you can actually restore from it in an incident.
  4. Staleness is acceptable. The restore point is whenever the last checkpoint finished; missing updates are re-hydrated from authoritative sources (in Corrosion's case, workers re-publish their Machine state on startup).

Why it works

  • Uncorrelated failure domain — a gossip cluster meltdown doesn't touch S3.
  • Cheap — periodic snapshots to cold storage are orders of magnitude cheaper than a warm-standby cluster.
  • Simple — no distributed-consensus reasoning required to restore. "Zip up a file, put it in a bucket."
  • Complements the live distribution layer — the live layer handles normal operation; object storage handles the catastrophic case.

Canonical wiki instance — Fly.io Corrosion

From sources/2025-10-22-flyio-corrosion:

"No amount of testing will make us trust a distributed system. So we've made it simpler to rebuild Corrosion's database from our workers. We keep checkpoint backups of the Corrosion database on object storage. That was smart of us. When shit truly went haywire last year, we had the option to reboot the cluster, which is ultimately what we did. That eats some time (the database is large and propagating is expensive), but diagnosing and repairing distributed systems mishaps takes even longer."

Fly's presumed object storage is Tigris or S3 (the post doesn't name it). Checkpoint-and-restore was the escape hatch used during the nullable-column-amplification incident — the cluster was rebooted rather than repaired.

Caveats

  • Checkpoint cadence matters — too infrequent and you lose more state on restore; too frequent and you pay storage + I/O overhead. Tune against recovery-time vs staleness.
  • Restore time is real — rehydrating a large gossip cluster takes significant time and network bandwidth. Don't design assuming it's a seconds-scale operation.
  • Schema compatibility — the checkpoint format must round-trip across code versions (especially if you rewind past a schema migration).
  • Object storage is eventually consistent in some shapes; make sure the restore path is robust to that.

Seen in

Last updated · 200 distilled / 1,178 read