PATTERN Cited by 1 source

Checkpoint-resumable fiber¶

Pattern¶

Run long-running agent work (tens of minutes to days) as a fiber — a regular async function — whose executions are registered in the host actor's durable storage before execution begins, checkpointed with developer-chosen save points via an explicit stash() call, and automatically resumed from the last checkpoint on any eviction, crash, deploy, or platform restart via an onFiberRecovered hook.

Key property: the durable-execution substrate is co-located with the agent's own actor — not a separate workflow engine in another tier. The actor is the execution unit + the recovery unit.

Canonical instance (Cloudflare Project Think, 2026-04-15)¶

From runFiber() in the Project Think launch post (Source: sources/2026-04-15-cloudflare-project-think-building-the-next-generation-of-ai-agents):

import { Agent } from "agents";

export class ResearchAgent extends Agent {
  async startResearch(topic: string) {
    void this.runFiber("research", async (ctx) => {
      const findings = [];
      for (let i = 0; i < 10; i++) {
        const result = await this.callLLM(`Research step ${i}: ${topic}`);
        findings.push(result);
        ctx.stash({ findings, step: i, topic });   // checkpoint
        this.broadcast({ type: "progress", step: i });
      }
      return { findings };
    });
  }

  async onFiberRecovered(ctx) {
    if (ctx.name === "research" && ctx.snapshot) {
      const { topic } = ctx.snapshot;
      await this.startResearch(topic);
    }
  }
}

The pattern's three moving parts:

runFiber("name", async fn) — registers the invocation in the actor's SQLite before execution, so the platform knows the fiber needs to resume on restart.
ctx.stash(snapshot) — a developer-chosen save point. Arbitrary serialisable snapshot.
onFiberRecovered(ctx) — runtime-invoked handler that receives the last-stashed snapshot on restart; the agent reads it + resumes from there.

The SDK transparently invokes keepAlive() during active fiber execution so the actor isn't evicted mid-step; for wall-clock-long work (hours-plus) the idiom is persist external job ID → hibernate → wake on callback rather than occupying the fiber's stack.

Contrast with workflow engines¶

Cloudflare Workflows, Temporal, AWS Step Functions: each step is automatically checkpointed (its input, output, determinism-critical randomness recorded in a durable event log) and the whole workflow is replayed on recovery, skipping already-executed steps.

Fiber shape: checkpoints are developer-chosen, not per-step. Replay is not automatic — the developer reads the snapshot in onFiberRecovered and re-invokes logic. The two trade determinism-enforcement discipline (workflow engines enforce it; fibers don't) for lower ceremony (fibers are a regular async function with ctx.stash() calls, not a per-step activity / task definition).

Axis	Workflow engine (Temporal, Workflows)	Fiber (Project Think)
Checkpoint granularity	per-step, automatic	developer-chosen via `stash()`
Replay on recovery	automatic (events replayed)	manual (read snapshot, re-invoke)
Determinism contract	enforced by engine	developer responsibility
Tier	separate orchestration tier	co-located with agent actor
Ceremony	task definitions, activities	regular `async` function
Typical use	multi-service orchestration	agent-loop-scoped durable execution

Prerequisites¶

Actor substrate that can register the fiber before execution, persist checkpoints, and invoke the recovery handler. Durable Objects is the canonical wiki example; other actor runtimes (Orleans, Akka Cluster + Persistence) have similar primitives.
Embedded durable storage co-located with the actor — SQLite in DO, state-store in Orleans, event-sourced journal in Akka — otherwise fiber registrations / checkpoints incur an external round-trip.
Idempotence at save points. If a side-effecting operation runs immediately before a stash() call and recovery invokes onFiberRecovered before the stash was durable, replay re-executes the side effect. Developer must arrange side effects to be idempotent or move them inside the same transaction as the checkpoint.
Client-side resumable streams for user-facing agent loops: the client SDK reconnects + resumes the stream rather than showing a crashed session.

Design choices within the pattern¶

Checkpoint frequency. Frequent stash() calls → cheap recovery + higher storage cost. Infrequent → long re-execution windows on crash.
Snapshot size. Keep snapshots small — only the minimum needed to reconstruct progress. Large snapshots hit storage + deserialisation costs on every checkpoint.
Keepalive posture. keepAlive() for unconditional, keep AliveWhile(fn) for conditional — the condition should become false when the fiber is genuinely waiting on an external callback so the actor can hibernate.
External job IDs for hours-plus work. The fiber stashes an external_job_id, hibernates, and the actor wakes on alarm or external webhook + dispatches back into the fiber.
Sibling fibers. Multiple named fibers in one actor — each independently checkpointed, resumed on restart, addressable for observability + cancellation.

When the pattern fits¶

Long-running agent work: multi-turn research loops, multi- minute code-generation sessions, multi-hour pipelines.
Work that accumulates substantial intermediate state.
Work that interacts with slow external systems (LLM provider APIs, CI systems, human review).

When it doesn't¶

Work short enough to retry-on-failure end-to-end (< a few seconds).
Multi-service orchestration with strict step-determinism requirements — use a workflow engine with automatic per-step checkpointing + replay.
Work whose side effects are not easily made idempotent; fibers don't give you the deterministic-replay guarantee.

Seen in¶

sources/2026-04-15-cloudflare-project-think-building-the-next-generation-of-ai-agents — canonical wiki instance. runFiber() + stash() + onFiberRecovered introduced as a first-class primitive in the Project Think SDK. Motivating framing: "An LLM call takes 30 seconds. A multi-turn agent loop can run for much longer. At any point during that window, the execution environment can vanish."

systems/project-think — the SDK where this pattern ships as runFiber().
systems/cloudflare-durable-objects — the actor substrate the fiber registration lives inside.
systems/cloudflare-workflows — the sibling workflow-engine tier in Cloudflare's platform.
systems/temporal — workflow-engine exemplar with a different checkpoint discipline.
concepts/durable-execution — the general property this pattern realises.
concepts/actor-model — the substrate primitive the pattern sits on.
concepts/wal-write-ahead-logging — the underlying durability discipline.
patterns/colocated-child-actor-rpc — composes inside a parent fiber that spawns sub-agents via Facets.
patterns/tree-structured-conversation-memory — the session persistence a long-running fiber typically reads + writes from.