Skip to content

CONCEPT Cited by 2 sources

Fast checkpoint via metadata shuffle

Definition

A checkpoint / restore implementation in which checkpoint = snapshot the metadata tier; restore = re-point the running workload to a previous metadata snapshot. The data tier is not copied, not moved, not even read. The cost of checkpoint / restore is dominated by the metadata tier's size (megabytes to hundreds of megabytes, typically), not by the data tier's size (hundreds of gigabytes).

Contingent on an architectural precondition: the data tier is content-addressed and immutable, which lets different metadata snapshots share the same underlying content blobs.

Canonical wiki statement

Fly.io Sprites, 2026-01-14:

"This also buys Sprites fast checkpoint and restore. Checkpoints are so fast we want you to use them as a basic feature of the system and not as an escape hatch when things go wrong; like a git restore, not a system restore. That works because both checkpoint and restore merely shuffle metadata around."

"(our pre-installed Claude Code will checkpoint aggressively for you without asking)"

(Source: [[sources/2026-01-14-flyio-the-design- implementation-of-sprites]])

Paired user-facing disclosure from the [[sources/2026-01-09-flyio-code-and-let-live|2026-01-09 launch post]]: rm -rf $HOME/bin, dd if=/dev/random of=/dev/vdb, ill-advised pip3 install"everything's broken. So: sprite checkpoint restore v1" — ~1 second later the Sprite is back to the pre-damage state.

Why it works

Three building blocks:

  1. Metadata/data split. A small metadata DB maps {file, offset} → {chunk-id, …}. A large content tier stores chunks keyed by id.
  2. Immutable content tier. Chunks, once written, are not mutated. A new version of a file produces new chunk writes; old chunks remain.
  3. Cheap metadata cloning. The metadata tier's native primitives (snapshot, clone, copy) are O(metadata-size), often sub-second on the scales that apply (SQLite DB of MB-scale for a Sprite).

With these in place:

  • Checkpoint: copy / snapshot the metadata DB at an instant. The content tier is unchanged. New writes made after checkpoint produce new chunks; they don't mutate any chunks referenced by the checkpoint.
  • Restore: swap the running workload's metadata DB to the checkpoint copy. The content tier still holds every chunk the checkpoint's metadata references.

The user's view: "system is back to the pre-damage state in ~1 second". The platform's view: "we re-pointed a SQLite pointer".

Ptacek's framing: git restore, not system restore

Ptacek explicitly contrasts two mental models:

  • System restore — emergency, slow, restore-from-backup, loss of in-flight state. Use sparingly.
  • Git restore — ordinary, fast, use-it-freely, part of the normal workflow.

Fast-checkpoint-via-metadata-shuffle lets a VM-level checkpoint live in the git-restore quadrant. The agent workflow at Fly.io's Claude-Code integration leans on this: "our pre-installed Claude Code will checkpoint aggressively for you without asking."

Relationship to the product-level concept

concepts/first-class-checkpoint-restore is the product concept — "checkpoint/restore as an ordinary feature, not an escape hatch". Fast-checkpoint-via-metadata-shuffle is the implementation mechanism that makes the product concept viable. A checkpoint/restore backed by block-copy-on-demand would support the same semantics but not at the speed required to make the product claim.

Comparison axes

Checkpoint mechanism Cost Fits "git restore" shape?
Full disk copy O(disk-size) No — minutes
Block-level snapshot (CoW on block dev) O(delta-since-snapshot) Sometimes
Metadata snapshot + immutable chunks O(metadata-size) Yes — sub-second
Full VM snapshot (mem + disk + state) O(VM-working-set) Often no

Sprites land in the third row.

Operational numbers

  • Checkpoint create: "completes instantly" from the 2026-01-09 post.
  • Restore: ~1 second wall-clock.
  • Not disclosed: metadata-DB size per Sprite, checkpoint- storage per checkpoint, retention policy, checkpoint count ceiling, garbage-collection policy for unreferenced chunks in the content tier.

Caveats

  • Only works because chunks are immutable. If the content tier were mutable in place, a restore would need to reverse chunk mutations — fundamentally different cost profile.
  • Memory / process state is a separate question. The Sprites post doesn't separately describe live-memory checkpointing; whether restore bounces the inner container from cold-start vs restores live memory is not stated. Ptacek's scenario (break filesystem, run sprite checkpoint restore v1) is filesystem-level; a memory-restore story would involve CRIU or equivalent and isn't explicitly addressed.
  • GC of unreferenced chunks is an important ops concern. If nothing reclaims chunks referenced only by deleted checkpoints, the content tier grows monotonically. The post doesn't describe a GC mechanism.

Seen in

  • [[sources/2026-01-14-flyio-the-design-implementation-of- sprites]] — canonical wiki statement of the mechanism.
  • sources/2026-01-09-flyio-code-and-let-live — canonical wiki statement of the product-level UX.
Last updated · 319 distilled / 1,201 read