PATTERN Cited by 1 source
Checkpoint before risky step¶
Pattern¶
Before an AI agent executes any potentially destructive command in its sandbox, take a copy-on-write checkpoint of the sandbox state. If the command destroys files, removes tooling, or corrupts the environment, restore to the checkpoint instead of rebuilding from scratch.
Mechanism¶
- Agent identifies a command as potentially risky (file deletion, package removal, config mutation).
- Sandbox runtime takes a CoW checkpoint — fast because only page-table metadata is duplicated.
- Command executes in the sandbox.
- On failure or undesired outcome: restore checkpoint (~seconds), retry with a different approach.
- On success: optionally discard the checkpoint or advance to a new one.
Trade-offs¶
| Pro | Con |
|---|---|
| Worst case is "restore and retry" not "rebuild from backup" | Requires a sandbox runtime that supports CoW snapshots |
| Cheap enough to be a reflex (per every risky step) | Checkpoint restore time (~9s observed) adds latency on failure path |
| Enables unattended agent operation | State outside the sandbox (network calls, DB writes) is not captured |
Seen in¶
- systems/fly-sprites — demonstrated in the Hermes Agent backend:
rm -rfwiped app + git, 9-second checkpoint restore recovered everything (Source: sources/2026-06-08-flyio-building-agents-that-dont-break-themselves) - Conceptually related to database concepts/point-in-time-recovery applied at VM level