PATTERN Cited by 2 sources

Checkpoint backup to object storage¶

Problem¶

A distributed state-distribution system (gossip, CRDT, replicated database) can enter a state where reasoning about the live data is harder than rebuilding it from scratch — contagious deadlocks, schema-migration amplification, retry storms, or simply accumulated corruption. In such moments you need a break-glass option: rebuild the cluster from a known-good state quickly.

Pattern¶

Maintain periodic full-database checkpoints in object storage (S3, R2, GCS, Tigris, etc.) as a side-path to the live state-distribution mechanism. Properties the pattern needs:

Checkpoints are whole-database snapshots, not deltas — they must be restorable without replay of intermediate state.
Object storage is independently available from the live cluster. Its failure mode is uncorrelated with the cluster's.
Restore is documented and rehearsed. The checkpoint is worth what you can actually restore from it in an incident.
Staleness is acceptable. The restore point is whenever the last checkpoint finished; missing updates are re-hydrated from authoritative sources (in Corrosion's case, workers re-publish their Machine state on startup).

Why it works¶

Uncorrelated failure domain — a gossip cluster meltdown doesn't touch S3.
Cheap — periodic snapshots to cold storage are orders of magnitude cheaper than a warm-standby cluster.
Simple — no distributed-consensus reasoning required to restore. "Zip up a file, put it in a bucket."
Complements the live distribution layer — the live layer handles normal operation; object storage handles the catastrophic case.

Canonical wiki instance — Fly.io Corrosion¶

From sources/2025-10-22-flyio-corrosion:

"No amount of testing will make us trust a distributed system. So we've made it simpler to rebuild Corrosion's database from our workers. We keep checkpoint backups of the Corrosion database on object storage. That was smart of us. When shit truly went haywire last year, we had the option to reboot the cluster, which is ultimately what we did. That eats some time (the database is large and propagating is expensive), but diagnosing and repairing distributed systems mishaps takes even longer."

Fly's presumed object storage is Tigris or S3 (the post doesn't name it). Checkpoint-and-restore was the escape hatch used during the nullable-column-amplification incident — the cluster was rebooted rather than repaired.

Caveats¶

Checkpoint cadence matters — too infrequent and you lose more state on restore; too frequent and you pay storage + I/O overhead. Tune against recovery-time vs staleness.
Restore time is real — rehydrating a large gossip cluster takes significant time and network bandwidth. Don't design assuming it's a seconds-scale operation.
Schema compatibility — the checkpoint format must round-trip across code versions (especially if you rewind past a schema migration).
Object storage is eventually consistent in some shapes; make sure the restore path is robust to that.

Seen in¶

sources/2025-10-22-flyio-corrosion — canonical primary source. "That was smart of us."
[[sources/2026-01-14-flyio-the-design-implementation-of- sprites]] — applied at per-account orchestrator metadata granularity. The global Sprites orchestrator is an Elixir/Phoenix app using object storage as the primary metadata source for accounts; each account also gets an "independent SQLite database, again made durable on object storage with Litestream." Two different applications of checkpoint-style durability to object storage: whole- cluster-rebuild-safety-net (Corrosion shape) and per-tenant-durability-substrate (Sprites orchestrator shape). Same architectural property — object storage as uncorrelated-failure-domain for critical state — applied at different granularities.

systems/corrosion-swim — the system protected.
concepts/break-glass-escape-hatch — the class of mitigation checkpoint-restore belongs to.
concepts/disaster-recovery-tiers — positions this pattern in the DR spectrum.
patterns/two-level-regional-global-state — complementary structural mitigation (regionalization + checkpoint backups together).