PATTERN Cited by 2 sources
Checkpoint as metadata clone¶
Problem¶
A storage or VM product wants first-class checkpoint /
restore — fast enough that users treat checkpoints like
git commit (cheap, often, ordinary) rather than
snapshot-restore (expensive, rare, emergency-only). Copying
bytes at checkpoint time scales linearly with dataset size and
blows the latency budget for anything larger than a few
gigabytes.
Pattern¶
On a metadata + chunk storage stack with immutable chunks:
- Checkpoint = snapshot the metadata DB at an instant. Content chunks are unchanged and shared across all checkpoints.
- Restore = re-point the running workload to the checkpoint's metadata. Content chunks don't move; the running VM transparently starts reading from the older chunk set.
Cost: O(metadata-size), not O(data-size). For Sprites-shape workloads (dozens-of-MB metadata DBs, hundreds-of-GB content tiers) the difference is 3-4 orders of magnitude.
Precondition: chunks must be immutable. Writes produce new chunks; old chunks referenced by prior checkpoints remain live.
Canonical wiki instance — Fly.io Sprites¶
"This also buys Sprites fast
checkpointandrestore. Checkpoints are so fast we want you to use them as a basic feature of the system and not as an escape hatch when things go wrong; like a git restore, not a system restore. That works because bothcheckpointandrestoremerely shuffle metadata around."(Source: [[sources/2026-01-14-flyio-the-design- implementation-of-sprites]])
User-facing UX from the [[sources/2026-01-09-flyio-code-and-
let-live|2026-01-09 launch post]]: break the filesystem
deliberately (rm -rf, dd random, pip3 install) → run
sprite checkpoint restore v1 → the Sprite is back to the
pre-damage state in ~1 second.
Ptacek's guidance — "our pre-installed Claude Code will checkpoint aggressively for you without asking" — leans on checkpoint-as-metadata-clone being cheap enough for agent-driven auto-checkpointing.
UX framing: git restore, not system restore¶
The cost profile of the pattern reshapes the product:
| Model | UX framing | When used |
|---|---|---|
| Slow checkpoint (O(data)) | system restore | Emergency only |
| Fast checkpoint (O(metadata)) | git restore | Routine workflow |
Both are checkpoint-restore, but the economics flip which workflows the feature fits into.
Adjacencies¶
- Git. Git's snapshot model is arguably the archetypal metadata-clone pattern — commits reference blob hashes; content-addressable blobs mean branching / tagging / rewinding are constant-time metadata ops.
- ZFS / BTRFS snapshots. Filesystem-level pattern with the same shape, at the block level.
- Copy-on-write memory. Process
fork()in Unix — metadata clone (page tables), data shared until divergent writes materialise new pages. - Docker image layering. Each layer is metadata over immutable content-addressed layers; branching a layer is metadata-cheap.
- Git-LFS, dvc. Metadata-pointer files track immutable content-addressed payloads.
Sprites is one of the first cases where a full VM-level disk is wrapped in this model for per-VM interactive checkpoint/restore.
Trade-offs¶
- Immutable-chunks GC. Chunks referenced only by deleted checkpoints must be reclaimed. The post doesn't describe Sprites' GC mechanism.
- Checkpoint retention. Every retained checkpoint roots a
metadata snapshot. Storage cost =
chunks-uniquely-retained + sum(metadata-snapshot-size). - Memory-state checkpoints not addressed. The Sprites post is filesystem-level. Memory + process-state checkpointing is a separate (usually more expensive) operation (CRIU-shape).
- Pre-existing mutable data. If some tier of the storage stack is mutable in place, the pattern breaks — you'd have to either convert to immutable on checkpoint or refuse to checkpoint mutating data.
- Metadata-DB scale ceiling. Very large metadata DBs make "clone" itself nontrivial. SQLite-scale (MB) is comfortable; TB-scale metadata breaks the pattern.
Related patterns¶
- patterns/metadata-plus-chunk-storage-stack — the architectural precondition.
- patterns/checkpoint-backup-to-object-storage — a DR- level pattern (rebuild whole cluster from last good checkpoint); overlaps in vocabulary but different use case and failure model.
- patterns/blobless-clone-lazy-hydrate — related shape in the Git-at-scale world.
Seen in¶
- [[sources/2026-01-14-flyio-the-design-implementation-of- sprites]] — mechanism disclosure.
- sources/2026-01-09-flyio-code-and-let-live — UX disclosure.
Related¶
- systems/fly-sprites
- systems/juicefs
- systems/litestream
- systems/sqlite
- concepts/fast-checkpoint-via-metadata-shuffle
- concepts/metadata-data-split-storage
- concepts/immutable-object-storage
- concepts/first-class-checkpoint-restore
- patterns/metadata-plus-chunk-storage-stack
- patterns/checkpoint-backup-to-object-storage
- companies/flyio