CONCEPT Cited by 1 source
First-class checkpoint / restore¶
Definition¶
A sandbox / VM primitive where creating and restoring system-state snapshots is treated as part of the ordinary user workflow, not an escape-hatch-for-emergencies. The defining properties (from Thomas Ptacek's 2026-01-09 Sprites launch, sources/2026-01-09-flyio-code-and-let-live):
- Create is cheap enough to do casually —
"
sprite-env checkpoints createcompletes instantly. Didn't even bother to measure." - Restore is cheap enough to do interactively — "Restore took about one second. It's fast enough to use casually, interactively."
- UX framing is ordinary-course, not escape-hatch — "Not an escape hatch. Rather: an intended part of the ordinary course of using a Sprite."
- Mental model is version-control-for-system-state —
"Like
git, but for the whole system."
Why the first-class framing matters¶
Snapshot + restore has existed as a VM primitive for decades
(VMware's snapshot feature, qemu-img snapshot, AWS EC2 AMIs,
CRIU for Linux processes, Firecracker's snapshot API). What's
different is the position in the workflow:
- Escape-hatch snapshotting: "Before we do something risky, take a snapshot; if disaster strikes, revert." The snapshot is rare, heavy, friction-laden — you need to think about whether the situation warrants it. Default is not to snapshot.
- First-class snapshotting: snapshot + restore is ordinary — you checkpoint after any useful state you want to preserve, you restore whenever convenient. Default is to snapshot whenever state changes that you'd want to roll back to.
The transition between the two depends on three things:
- Create-side cost must be negligible (Sprites claim "completes instantly"; you stop noticing it).
- Restore-side cost must be negligible (Sprites claim ~1 second; you stop noticing it).
- Framing — documentation, defaults, tutorials all treat it as ordinary. Without this, the primitive exists but the user still treats it as escape-hatch.
Demo from the canonical post¶
Ptacek's demonstration (verbatim, trimmed):
Say I get an application up on its legs. Install more packages. Then: disaster. Maybe an ill-advised global
pip3 install. Orrm -rf $HMOE/bin. Ordd if=/dev/random of=/dev/vdb. Whatever it was, everything's broken. So:> $ sprite checkpoint restore v1 Restoring from checkpoint v1... Container components started successfully Restore from v1 completeSprites have first-class checkpoint and restore.
Three moves on the way to ordinary-course:
- Casual mistakes get casual recoveries. Nothing about the three example failures was rare; they're the kind of typos / accidents / agent-induced breakage that happen regularly.
- No narrative about "should I revert". The text treats revert as the next action, not as an extraordinary decision.
- Preserved semantics. The Sprite is back where it was — not a new VM, not a fresh image, the same computer with the intervening damage undone.
Replaces VM-replacement as the blast-radius mechanism¶
In the disposable-VM pattern, blast-radius is bounded by VM replacement: the next session gets a fresh VM, damage doesn't carry forward. In the durable-VM pattern, blast-radius is bounded by checkpoint / restore: the VM persists, but destructive mistakes are reverted via snapshot rollback.
Ptacek's argument (sources/2026-01-09-flyio-code-and-let-live): when restore is ~1s and casual, it provides the same blast-radius guarantee without paying the ephemeral-sandbox costs (node_modules rebuilds, external infrastructure for durable state, plan-file key-value stores). "Not an escape hatch. Rather: an intended part of the ordinary course."
Comparison: other "checkpoint" mechanisms on the wiki¶
| Mechanism | Granularity | Use-frequency |
|---|---|---|
| patterns/checkpoint-backup-to-object-storage (Corrosion 2025-10) | Per-cluster DB | "Ultimately" — cluster-reboot escape-hatch |
| CRIU process checkpoints | Per-process | Rare — debugging / migration |
| AWS EC2 AMIs | Per-VM image | Rare — release / provisioning |
| Firecracker snapshot API | Per-microVM | Per-invocation (Lambda) — platform-internal |
| Database PITR (systems/litestream, PostgreSQL WAL) | Per-DB-state | Continuous — platform-internal |
| Sprite checkpoints (systems/fly-sprites) | Per-VM (including disk) | Ordinary course, user-driven |
The Corrosion checkpoint-backup pattern is the escape-hatch comparison: Corrosion checkpoints are cheap to create (object-storage upload) but the restore path is cluster- reboot-heavy — "ultimately" used when diagnosis exceeded restore time. Sprite checkpoints are designed for the opposite end: restore is the ordinary path.
Prerequisites¶
- Fast-boot substrate — ~1s restore requires the equivalent of a fast-VM-boot primitive; see Sprites' "completed installation of ffmpeg preserved across restore" behaviour.
- Storage stack that captures the relevant state — a checkpoint that misses disk state, or in-flight network state, or kernel memory pages the app cares about, isn't ordinary-course. Sprite checkpoints are specifically sold as capturing enough state that "everything's where I left it".
- User-facing CLI that normalises the verbs —
checkpoints create,checkpoint restoreas first-class subcommands. Without this the primitive exists but the user doesn't reach for it casually. - Cheap checkpoint storage — creating checkpoints often means the average user accumulates many; if each costs real money, users revert to escape-hatch use.
Caveats¶
- State-capture scope is a load-bearing claim. Ptacek's post doesn't specify what Sprite checkpoints capture (kernel memory? Page cache? In-flight TCP connections? File locks? Running-process state? Kernel state?). The "everything's where I left it" claim is narrative; the mechanism disclosure is deferred.
- Restore-during-active-connection semantics are unspecified — Sprites have Anycast HTTPS URLs; what happens to in-flight requests during a restore?
- Checkpoint coordination across services — the Sprite + external-service-state story: if the Sprite holds a DB connection to an external service, restoring the Sprite doesn't restore the external state; consistency across the boundary isn't discussed.
- First-class framing is editorial as much as technical. A vendor can ship snapshot primitives and still fail the ordinary-course bar by making the CLI awkward, the defaults stingy, or the documentation surface the feature as "advanced."
Seen in¶
- sources/2026-01-09-flyio-code-and-let-live —
canonical source. Ptacek's Sprites announcement. The
"like
git, but for the whole system" framing + casual-restore demo + "not an escape hatch" editorial commit are all from this post.
Related¶
- concepts/durable-vs-ephemeral-sandbox — the axis this property enables the durable end of.
- concepts/fast-vm-boot-dx — fast-boot is the precondition for casual restore.
- concepts/agentic-development-loop — the workload where ordinary-course restore changes the flow.
- concepts/agent-with-root-shell — agents with root get into more trouble; first-class restore tempers the consequence.
- patterns/durable-micro-vm-for-agentic-loop — the pattern this property rescues from fragility.
- patterns/checkpoint-backup-to-object-storage — the escape-hatch comparison for cluster-level state.
- systems/fly-sprites — canonical product instance.
- systems/phoenix-new — the companion ephemeral product where checkpoint-restore is not the recovery posture (session-reset is instead).