CONCEPT Cited by 2 sources

Fast checkpoint via metadata shuffle¶

Definition¶

A checkpoint / restore implementation in which checkpoint = snapshot the metadata tier; restore = re-point the running workload to a previous metadata snapshot. The data tier is not copied, not moved, not even read. The cost of checkpoint / restore is dominated by the metadata tier's size (megabytes to hundreds of megabytes, typically), not by the data tier's size (hundreds of gigabytes).

Contingent on an architectural precondition: the data tier is content-addressed and immutable, which lets different metadata snapshots share the same underlying content blobs.

Canonical wiki statement¶

Fly.io Sprites, 2026-01-14:

"This also buys Sprites fast checkpoint and restore. Checkpoints are so fast we want you to use them as a basic feature of the system and not as an escape hatch when things go wrong; like a git restore, not a system restore. That works because both checkpoint and restore merely shuffle metadata around."

"(our pre-installed Claude Code will checkpoint aggressively for you without asking)"

(Source: [[sources/2026-01-14-flyio-the-design- implementation-of-sprites]])

Paired user-facing disclosure from the [[sources/2026-01-09-flyio-code-and-let-live|2026-01-09 launch post]]: rm -rf $HOME/bin, dd if=/dev/random of=/dev/vdb, ill-advised pip3 install — "everything's broken. So: sprite checkpoint restore v1" — ~1 second later the Sprite is back to the pre-damage state.

Why it works¶

Three building blocks:

Metadata/data split. A small metadata DB maps {file, offset} → {chunk-id, …}. A large content tier stores chunks keyed by id.
Immutable content tier. Chunks, once written, are not mutated. A new version of a file produces new chunk writes; old chunks remain.
Cheap metadata cloning. The metadata tier's native primitives (snapshot, clone, copy) are O(metadata-size), often sub-second on the scales that apply (SQLite DB of MB-scale for a Sprite).

With these in place:

Checkpoint: copy / snapshot the metadata DB at an instant. The content tier is unchanged. New writes made after checkpoint produce new chunks; they don't mutate any chunks referenced by the checkpoint.
Restore: swap the running workload's metadata DB to the checkpoint copy. The content tier still holds every chunk the checkpoint's metadata references.

The user's view: "system is back to the pre-damage state in ~1 second". The platform's view: "we re-pointed a SQLite pointer".

Ptacek's framing: `git restore`, not `system restore`¶

Ptacek explicitly contrasts two mental models:

System restore — emergency, slow, restore-from-backup, loss of in-flight state. Use sparingly.
Git restore — ordinary, fast, use-it-freely, part of the normal workflow.

Fast-checkpoint-via-metadata-shuffle lets a VM-level checkpoint live in the git-restore quadrant. The agent workflow at Fly.io's Claude-Code integration leans on this: "our pre-installed Claude Code will checkpoint aggressively for you without asking."

Relationship to the product-level concept¶

concepts/first-class-checkpoint-restore is the product concept — "checkpoint/restore as an ordinary feature, not an escape hatch". Fast-checkpoint-via-metadata-shuffle is the implementation mechanism that makes the product concept viable. A checkpoint/restore backed by block-copy-on-demand would support the same semantics but not at the speed required to make the product claim.

Comparison axes¶

Checkpoint mechanism	Cost	Fits "git restore" shape?
Full disk copy	O(disk-size)	No — minutes
Block-level snapshot (CoW on block dev)	O(delta-since-snapshot)	Sometimes
Metadata snapshot + immutable chunks	O(metadata-size)	Yes — sub-second
Full VM snapshot (mem + disk + state)	O(VM-working-set)	Often no

Sprites land in the third row.

Operational numbers¶

Checkpoint create: "completes instantly" from the 2026-01-09 post.
Restore: ~1 second wall-clock.
Not disclosed: metadata-DB size per Sprite, checkpoint- storage per checkpoint, retention policy, checkpoint count ceiling, garbage-collection policy for unreferenced chunks in the content tier.

Caveats¶

Only works because chunks are immutable. If the content tier were mutable in place, a restore would need to reverse chunk mutations — fundamentally different cost profile.
Memory / process state is a separate question. The Sprites post doesn't separately describe live-memory checkpointing; whether restore bounces the inner container from cold-start vs restores live memory is not stated. Ptacek's scenario (break filesystem, run sprite checkpoint restore v1) is filesystem-level; a memory-restore story would involve CRIU or equivalent and isn't explicitly addressed.
GC of unreferenced chunks is an important ops concern. If nothing reclaims chunks referenced only by deleted checkpoints, the content tier grows monotonically. The post doesn't describe a GC mechanism.

Seen in¶

[[sources/2026-01-14-flyio-the-design-implementation-of- sprites]] — canonical wiki statement of the mechanism.
sources/2026-01-09-flyio-code-and-let-live — canonical wiki statement of the product-level UX.

systems/fly-sprites
systems/juicefs — the storage architecture the mechanism lives inside.
systems/litestream — the metadata-DB durability substrate.
concepts/metadata-data-split-storage — the architectural precondition.
concepts/object-storage-as-disk-root — the durability anchor.
concepts/immutable-object-storage — the invariant the mechanism leans on.
concepts/first-class-checkpoint-restore — the product concept.
patterns/checkpoint-as-metadata-clone — canonical pattern.
patterns/checkpoint-backup-to-object-storage — complementary DR-oriented pattern; distinct use case.
companies/flyio