PATTERN Cited by 1 source
Checkpoint-resumable fiber¶
Pattern¶
Run long-running agent work (tens of minutes to days) as a
fiber — a regular async function — whose executions are
registered in the host actor's durable storage before execution
begins, checkpointed with developer-chosen save points via an
explicit stash() call, and automatically resumed from the last
checkpoint on any eviction, crash, deploy, or platform restart
via an onFiberRecovered hook.
Key property: the durable-execution substrate is co-located with the agent's own actor — not a separate workflow engine in another tier. The actor is the execution unit + the recovery unit.
Canonical instance (Cloudflare Project Think, 2026-04-15)¶
From runFiber() in the Project Think launch post
(Source:
sources/2026-04-15-cloudflare-project-think-building-the-next-generation-of-ai-agents):
import { Agent } from "agents";
export class ResearchAgent extends Agent {
async startResearch(topic: string) {
void this.runFiber("research", async (ctx) => {
const findings = [];
for (let i = 0; i < 10; i++) {
const result = await this.callLLM(`Research step ${i}: ${topic}`);
findings.push(result);
ctx.stash({ findings, step: i, topic }); // checkpoint
this.broadcast({ type: "progress", step: i });
}
return { findings };
});
}
async onFiberRecovered(ctx) {
if (ctx.name === "research" && ctx.snapshot) {
const { topic } = ctx.snapshot;
await this.startResearch(topic);
}
}
}
The pattern's three moving parts:
runFiber("name", async fn)— registers the invocation in the actor's SQLite before execution, so the platform knows the fiber needs to resume on restart.ctx.stash(snapshot)— a developer-chosen save point. Arbitrary serialisable snapshot.onFiberRecovered(ctx)— runtime-invoked handler that receives the last-stashed snapshot on restart; the agent reads it + resumes from there.
The SDK transparently invokes keepAlive() during active fiber
execution so the actor isn't evicted mid-step; for wall-clock-long
work (hours-plus) the idiom is persist external job ID →
hibernate → wake on callback rather than occupying the fiber's
stack.
Contrast with workflow engines¶
Cloudflare Workflows, Temporal, AWS Step Functions: each step is automatically checkpointed (its input, output, determinism-critical randomness recorded in a durable event log) and the whole workflow is replayed on recovery, skipping already-executed steps.
Fiber shape: checkpoints are developer-chosen, not per-step.
Replay is not automatic — the developer reads the snapshot in
onFiberRecovered and re-invokes logic. The two trade
determinism-enforcement discipline (workflow engines enforce it;
fibers don't) for lower ceremony (fibers are a regular
async function with ctx.stash() calls, not a per-step
activity / task definition).
| Axis | Workflow engine (Temporal, Workflows) | Fiber (Project Think) |
|---|---|---|
| Checkpoint granularity | per-step, automatic | developer-chosen via stash() |
| Replay on recovery | automatic (events replayed) | manual (read snapshot, re-invoke) |
| Determinism contract | enforced by engine | developer responsibility |
| Tier | separate orchestration tier | co-located with agent actor |
| Ceremony | task definitions, activities | regular async function |
| Typical use | multi-service orchestration | agent-loop-scoped durable execution |
Prerequisites¶
- Actor substrate that can register the fiber before execution, persist checkpoints, and invoke the recovery handler. Durable Objects is the canonical wiki example; other actor runtimes (Orleans, Akka Cluster + Persistence) have similar primitives.
- Embedded durable storage co-located with the actor — SQLite in DO, state-store in Orleans, event-sourced journal in Akka — otherwise fiber registrations / checkpoints incur an external round-trip.
- Idempotence at save points. If a side-effecting operation
runs immediately before a
stash()call and recovery invokesonFiberRecoveredbefore the stash was durable, replay re-executes the side effect. Developer must arrange side effects to be idempotent or move them inside the same transaction as the checkpoint. - Client-side resumable streams for user-facing agent loops: the client SDK reconnects + resumes the stream rather than showing a crashed session.
Design choices within the pattern¶
- Checkpoint frequency. Frequent
stash()calls → cheap recovery + higher storage cost. Infrequent → long re-execution windows on crash. - Snapshot size. Keep snapshots small — only the minimum needed to reconstruct progress. Large snapshots hit storage + deserialisation costs on every checkpoint.
- Keepalive posture.
keepAlive()for unconditional,keep AliveWhile(fn)for conditional — the condition should become false when the fiber is genuinely waiting on an external callback so the actor can hibernate. - External job IDs for hours-plus work. The fiber stashes
an
external_job_id, hibernates, and the actor wakes on alarm or external webhook + dispatches back into the fiber. - Sibling fibers. Multiple named fibers in one actor — each independently checkpointed, resumed on restart, addressable for observability + cancellation.
When the pattern fits¶
- Long-running agent work: multi-turn research loops, multi- minute code-generation sessions, multi-hour pipelines.
- Work that accumulates substantial intermediate state.
- Work that interacts with slow external systems (LLM provider APIs, CI systems, human review).
When it doesn't¶
- Work short enough to retry-on-failure end-to-end (< a few seconds).
- Multi-service orchestration with strict step-determinism requirements — use a workflow engine with automatic per-step checkpointing + replay.
- Work whose side effects are not easily made idempotent; fibers don't give you the deterministic-replay guarantee.
Seen in¶
- sources/2026-04-15-cloudflare-project-think-building-the-next-generation-of-ai-agents
— canonical wiki instance.
runFiber()+stash()+onFiberRecoveredintroduced as a first-class primitive in the Project Think SDK. Motivating framing: "An LLM call takes 30 seconds. A multi-turn agent loop can run for much longer. At any point during that window, the execution environment can vanish."
Related¶
- systems/project-think — the SDK where this pattern ships as
runFiber(). - systems/cloudflare-durable-objects — the actor substrate the fiber registration lives inside.
- systems/cloudflare-workflows — the sibling workflow-engine tier in Cloudflare's platform.
- systems/temporal — workflow-engine exemplar with a different checkpoint discipline.
- concepts/durable-execution — the general property this pattern realises.
- concepts/actor-model — the substrate primitive the pattern sits on.
- concepts/wal-write-ahead-logging — the underlying durability discipline.
- patterns/colocated-child-actor-rpc — composes inside a parent fiber that spawns sub-agents via Facets.
- patterns/tree-structured-conversation-memory — the session persistence a long-running fiber typically reads + writes from.