Skip to content

PATTERN Cited by 1 source

Checkpoint-resumable fiber

Pattern

Run long-running agent work (tens of minutes to days) as a fiber — a regular async function — whose executions are registered in the host actor's durable storage before execution begins, checkpointed with developer-chosen save points via an explicit stash() call, and automatically resumed from the last checkpoint on any eviction, crash, deploy, or platform restart via an onFiberRecovered hook.

Key property: the durable-execution substrate is co-located with the agent's own actor — not a separate workflow engine in another tier. The actor is the execution unit + the recovery unit.

Canonical instance (Cloudflare Project Think, 2026-04-15)

From runFiber() in the Project Think launch post (Source: sources/2026-04-15-cloudflare-project-think-building-the-next-generation-of-ai-agents):

import { Agent } from "agents";

export class ResearchAgent extends Agent {
  async startResearch(topic: string) {
    void this.runFiber("research", async (ctx) => {
      const findings = [];
      for (let i = 0; i < 10; i++) {
        const result = await this.callLLM(`Research step ${i}: ${topic}`);
        findings.push(result);
        ctx.stash({ findings, step: i, topic });   // checkpoint
        this.broadcast({ type: "progress", step: i });
      }
      return { findings };
    });
  }

  async onFiberRecovered(ctx) {
    if (ctx.name === "research" && ctx.snapshot) {
      const { topic } = ctx.snapshot;
      await this.startResearch(topic);
    }
  }
}

The pattern's three moving parts:

  1. runFiber("name", async fn) — registers the invocation in the actor's SQLite before execution, so the platform knows the fiber needs to resume on restart.
  2. ctx.stash(snapshot) — a developer-chosen save point. Arbitrary serialisable snapshot.
  3. onFiberRecovered(ctx) — runtime-invoked handler that receives the last-stashed snapshot on restart; the agent reads it + resumes from there.

The SDK transparently invokes keepAlive() during active fiber execution so the actor isn't evicted mid-step; for wall-clock-long work (hours-plus) the idiom is persist external job ID → hibernate → wake on callback rather than occupying the fiber's stack.

Contrast with workflow engines

Cloudflare Workflows, Temporal, AWS Step Functions: each step is automatically checkpointed (its input, output, determinism-critical randomness recorded in a durable event log) and the whole workflow is replayed on recovery, skipping already-executed steps.

Fiber shape: checkpoints are developer-chosen, not per-step. Replay is not automatic — the developer reads the snapshot in onFiberRecovered and re-invokes logic. The two trade determinism-enforcement discipline (workflow engines enforce it; fibers don't) for lower ceremony (fibers are a regular async function with ctx.stash() calls, not a per-step activity / task definition).

Axis Workflow engine (Temporal, Workflows) Fiber (Project Think)
Checkpoint granularity per-step, automatic developer-chosen via stash()
Replay on recovery automatic (events replayed) manual (read snapshot, re-invoke)
Determinism contract enforced by engine developer responsibility
Tier separate orchestration tier co-located with agent actor
Ceremony task definitions, activities regular async function
Typical use multi-service orchestration agent-loop-scoped durable execution

Prerequisites

  • Actor substrate that can register the fiber before execution, persist checkpoints, and invoke the recovery handler. Durable Objects is the canonical wiki example; other actor runtimes (Orleans, Akka Cluster + Persistence) have similar primitives.
  • Embedded durable storage co-located with the actor — SQLite in DO, state-store in Orleans, event-sourced journal in Akka — otherwise fiber registrations / checkpoints incur an external round-trip.
  • Idempotence at save points. If a side-effecting operation runs immediately before a stash() call and recovery invokes onFiberRecovered before the stash was durable, replay re-executes the side effect. Developer must arrange side effects to be idempotent or move them inside the same transaction as the checkpoint.
  • Client-side resumable streams for user-facing agent loops: the client SDK reconnects + resumes the stream rather than showing a crashed session.

Design choices within the pattern

  1. Checkpoint frequency. Frequent stash() calls → cheap recovery + higher storage cost. Infrequent → long re-execution windows on crash.
  2. Snapshot size. Keep snapshots small — only the minimum needed to reconstruct progress. Large snapshots hit storage + deserialisation costs on every checkpoint.
  3. Keepalive posture. keepAlive() for unconditional, keep AliveWhile(fn) for conditional — the condition should become false when the fiber is genuinely waiting on an external callback so the actor can hibernate.
  4. External job IDs for hours-plus work. The fiber stashes an external_job_id, hibernates, and the actor wakes on alarm or external webhook + dispatches back into the fiber.
  5. Sibling fibers. Multiple named fibers in one actor — each independently checkpointed, resumed on restart, addressable for observability + cancellation.

When the pattern fits

  • Long-running agent work: multi-turn research loops, multi- minute code-generation sessions, multi-hour pipelines.
  • Work that accumulates substantial intermediate state.
  • Work that interacts with slow external systems (LLM provider APIs, CI systems, human review).

When it doesn't

  • Work short enough to retry-on-failure end-to-end (< a few seconds).
  • Multi-service orchestration with strict step-determinism requirements — use a workflow engine with automatic per-step checkpointing + replay.
  • Work whose side effects are not easily made idempotent; fibers don't give you the deterministic-replay guarantee.

Seen in

Last updated · 200 distilled / 1,178 read