CONCEPT Cited by 2 sources
Durable execution¶
Durable execution is the property of a long-running computation (multi-minute LLM loops, multi-hour CI pipelines, multi-day workflows) that it survives any interruption of its host environment — process crash, platform restart, deploy, resource-limit eviction — and resumes from a known safe point without the caller observing failure, and ideally without losing committed intermediate results.
The general shape: every state-changing step is recorded in durable storage before execution begins (registered + sequenced), checkpointed at explicit save points, and replayed from the last checkpoint on recovery. Effectively write-ahead logging (concepts/wal-write-ahead-logging) applied to a unit of user- space execution.
Why agents force this problem¶
"An LLM call takes 30 seconds. A multi-turn agent loop can run for much longer. At any point during that window, the execution environment can vanish: a deploy, a platform restart, hitting resource limits. The upstream connection to the model provider is severed permanently, in-memory state is lost, and connected clients see the stream stop with no explanation." (Source: Cloudflare Project Think.)
Agents compound two properties that stateless request handlers don't share:
- Long wall-clock runtime per unit of work — not because of computation, but because each LLM call is seconds and each tool call is potentially more.
- Accumulated multi-step state — mid-loop findings, partial artifacts, tool-call results — that cannot be easily reconstructed from inputs alone.
Together they make "restart from scratch on crash" prohibitively expensive. Durable execution makes the crash invisible to the user (and to the downstream LLM, if the platform hides the reconnection).
Two implementation shapes¶
Step-sequencer / workflow engine¶
A separate orchestration tier — Temporal, AWS Step Functions, Cadence, Cloudflare Workflows — records each step's input + output + determinism-critical randomness in a durable event log. Replay on recovery: re-run from the first non-committed step, with "already-executed" steps replayed from log rather than re-invoked.
Fiber / co-routine-in-actor¶
The durable-execution primitive lives inside the agent's own
actor (a Durable Object
for Cloudflare), not in a separate orchestration tier. Project
Think's runFiber("name", async (ctx) => { … }) registers the
fiber in the DO's SQLite before execution begins; ctx.stash({
… }) checkpoints arbitrary user-defined state; onFiberRecovered
is called with the last stashed snapshot on restart. No separate
event log, no separate worker process — the actor is the
execution unit. See patterns/checkpoint-resumable-fiber.
The fiber shape trades step-level determinism enforcement
(Temporal's hard rule) for lower ceremony — a regular
async function with occasional ctx.stash() calls becomes
durable. The developer chooses what to checkpoint and what to
re-compute; the engine doesn't enforce a determinism contract.
Design axes¶
- Save-point granularity. Per-step (workflow engines) vs
developer-chosen (
ctx.stashin Project Think's fibers). Per- step reduces replay cost; developer-chosen reduces ceremony. - Keepalive discipline. If the platform evicts hibernating
actors during active work, the fiber must explicitly
keepAlive()/keepAliveWhile(cond)to prevent eviction during a synchronous long-running segment. Project Think bakes this into the fiber primitive. - Long-callback shape. For work that spans hours or days (video generation, CI pipelines, human review), the idiom is start the work → persist the external job ID → hibernate → wake on callback. The durable execution engine doesn't hold the compute; it holds the bookmark.
- Observability during replay. Clients watching a fiber stream (WebSockets, SSE) see a gap on crash + reconnect vs a seamless resume. Resumable streams (Project Think: "resumable streams") are the client-side half of the durable-execution story.
Cost framing¶
- Storage cost — fiber registrations + checkpoints + event logs persist in SQLite (DO) / a dedicated log store. Per-checkpoint cost is small but sums with fiber volume.
- Replay cost — if save points are coarse, crash replay re-executes many steps; side-effectful replays must be idempotent.
- Developer cost — writing checkpoint-friendly code is a discipline; forgetting to stash a crucial piece of state makes recovery incomplete. Workflow engines enforce this via determinism; Project Think doesn't.
Contrast with retry-on-failure¶
Retry is "restart the whole operation" — acceptable when the operation is short + stateless. Durable execution is "resume from mid-operation" — necessary when the operation is long or has committed intermediate state (sent emails, created tickets, paid for tokens).
Seen in¶
- sources/2026-04-15-cloudflare-project-think-building-the-next-generation-of-ai-agents
— canonical wiki instance.
runFiber()/stash()/onFiberRecoveredprimitive introduced in the Project Think SDK. The "30-second LLM call → 30-minute loop → platform restart" framing is the motivating example. - sources/2025-02-12-flyio-the-exit-interview-jp-phillips — durable execution as orchestrator state-machine substrate. JP Phillips on why flyd needs it: "Once I understood what the product needed to do and look like, having a way to perform deterministic and durable execution felt like a good design. … One of the biggest gains, with how it works in flyd, is knowing we would need to deploy flyd all day, every day. If flyd was in the middle of doing some work, it needed to pick back up right where it left off, post-deploy." flyd's specific embodiment: per-FSM-step records in a BoltDB database, lineage-linked to Cadence at HashiCorp and Compose.io/MongoHQ "recipes." Durable execution is not just an LLM-agent concern — it's the load-bearing property of every deploy-tolerant orchestrator.
Related¶
- systems/project-think — fibers + sessions + durable execution baked into the agent SDK.
- systems/cloudflare-durable-objects — the actor + SQLite substrate fiber registrations live in.
- systems/cloudflare-workflows — sibling durable-execution tier at the orchestration layer (not agent-loop scope).
- systems/temporal — step-sequencer exemplar (separate orchestration process, strict determinism).
- systems/cadence — Temporal's predecessor and direct ancestor of the flyd FSM design.
- systems/flyd — canonical orchestrator-level wiki instance of durable execution (per-FSM-step BoltDB records for Fly Machine lifecycle operations).
- patterns/checkpoint-resumable-fiber — the fiber-side implementation pattern.
- concepts/wal-write-ahead-logging — the underlying durability discipline applied to user-space execution.
- concepts/logless-reconfiguration — sibling durability property at a different granularity (consensus reconfiguration).