PATTERN Cited by 1 source

Checkpoint before risky step¶

Pattern¶

Before an AI agent executes any potentially destructive command in its sandbox, take a copy-on-write checkpoint of the sandbox state. If the command destroys files, removes tooling, or corrupts the environment, restore to the checkpoint instead of rebuilding from scratch.

Mechanism¶

Agent identifies a command as potentially risky (file deletion, package removal, config mutation).
Sandbox runtime takes a CoW checkpoint — fast because only page-table metadata is duplicated.
Command executes in the sandbox.
On failure or undesired outcome: restore checkpoint (~seconds), retry with a different approach.
On success: optionally discard the checkpoint or advance to a new one.

Trade-offs¶

Pro	Con
Worst case is "restore and retry" not "rebuild from backup"	Requires a sandbox runtime that supports CoW snapshots
Cheap enough to be a reflex (per every risky step)	Checkpoint restore time (~9s observed) adds latency on failure path
Enables unattended agent operation	State outside the sandbox (network calls, DB writes) is not captured

Seen in¶

systems/fly-sprites — demonstrated in the Hermes Agent backend: rm -rf wiped app + git, 9-second checkpoint restore recovered everything (Source: sources/2026-06-08-flyio-building-agents-that-dont-break-themselves)
Conceptually related to database concepts/point-in-time-recovery applied at VM level