CONCEPT Cited by 7 sources
Durable execution¶
Durable execution is the property of a long-running computation (multi-minute LLM loops, multi-hour CI pipelines, multi-day workflows) that it survives any interruption of its host environment — process crash, platform restart, deploy, resource-limit eviction — and resumes from a known safe point without the caller observing failure, and ideally without losing committed intermediate results.
The general shape: every state-changing step is recorded in durable storage before execution begins (registered + sequenced), checkpointed at explicit save points, and replayed from the last checkpoint on recovery. Effectively write-ahead logging (concepts/wal-write-ahead-logging) applied to a unit of user- space execution.
Why agents force this problem¶
"An LLM call takes 30 seconds. A multi-turn agent loop can run for much longer. At any point during that window, the execution environment can vanish: a deploy, a platform restart, hitting resource limits. The upstream connection to the model provider is severed permanently, in-memory state is lost, and connected clients see the stream stop with no explanation." (Source: Cloudflare Project Think.)
Agents compound two properties that stateless request handlers don't share:
- Long wall-clock runtime per unit of work — not because of computation, but because each LLM call is seconds and each tool call is potentially more.
- Accumulated multi-step state — mid-loop findings, partial artifacts, tool-call results — that cannot be easily reconstructed from inputs alone.
Together they make "restart from scratch on crash" prohibitively expensive. Durable execution makes the crash invisible to the user (and to the downstream LLM, if the platform hides the reconnection).
Three implementation shapes¶
Step-sequencer / workflow engine¶
A separate orchestration tier — Temporal, AWS Step Functions, Cadence, Cloudflare Workflows — records each step's input + output + determinism-critical randomness in a durable event log. Replay on recovery: re-run from the first non-committed step, with "already-executed" steps replayed from log rather than re-invoked.
Fiber / co-routine-in-actor¶
The durable-execution primitive lives inside the agent's own
actor (a Durable Object
for Cloudflare), not in a separate orchestration tier. Project
Think's runFiber("name", async (ctx) => { … }) registers the
fiber in the DO's SQLite before execution begins; ctx.stash({
… }) checkpoints arbitrary user-defined state; onFiberRecovered
is called with the last stashed snapshot on restart. No separate
event log, no separate worker process — the actor is the
execution unit. See patterns/checkpoint-resumable-fiber.
The fiber shape trades step-level determinism enforcement
(Temporal's hard rule) for lower ceremony — a regular
async function with occasional ctx.stash() calls becomes
durable. The developer chooses what to checkpoint and what to
re-compute; the engine doesn't enforce a determinism contract.
Embedded library-in-service¶
The workflow engine is a library dependency on the host
service, not a separate cluster and not an actor-inside-an-
actor-substrate. Airbnb's Skipper
canonicalises this shape: workflows and actions are plain
Java/Kotlin classes with annotations
(@WorkflowMethod / @StateField / @SignalMethod /
@Execute(checkpoint = true) / @Compensate), state lives in
the host service's existing database (MySQL / UDS / DynamoDB),
and the engine runs in-process on a dedicated thread pool. See
concepts/embedded-workflow-engine +
patterns/workflow-primitives-as-annotated-classes.
Durability uses state- field replay instead of event-history replay — leaner but with weaker auditability. The happy-path overhead is near zero via the delayed timeout task pattern: 2 DB writes at workflow start, batched action checkpoints, and a scheduled safety-net task that fires harmlessly on completion or triggers replay on crash — no coordinator round-trips per activity. Trades: no cross-language support, no cross-service orchestration.
Platform-hosted engine + dispatched per-tenant code¶
Cloudflare's Dynamic
Workflows (2026-05-01) makes explicit a variant that the three
shapes above left implicit: the platform hosts the workflow
engine (Workflows V2), but the run(event, step) body itself
is dispatched per tenant at invocation time rather than
statically bound at deploy time. A single Worker Loader routes
every create() call and every subsequent run() invocation
into the right tenant's code, loaded on demand as a
Dynamic Worker and cached by
tenant ID.
This variant sits between the step-sequencer and fiber shapes:
the durability machinery is the engine's (IDs, step.sleep(),
step.waitForEvent(), retries, hibernation, replay); but the
body being run is per-tenant, versioned in the tenant's repo,
loaded as code on each step boundary. Canonicalised in:
- concepts/per-tenant-dynamic-code-dispatch — the three- layer engine / dispatcher / tenant shape.
- concepts/envelope-wrap-and-unwrap-metadata-routing — the wire-format technique that threads routing metadata through the engine's persisted payload, surviving sleep / crash / redeploy.
- concepts/byo-workflow-per-tenant — the customer-ships- the-workflow-body mental model.
- patterns/dynamic-binding-over-static-binding — the general platform-design shape; Dynamic Workflows is the durable-execution instance.
- patterns/ci-pipeline-as-customer-authored-durable-workflow
— the canonical showcase (each repo ships
.cloudflare/ci.tswith its ownCIPipelineclass).
Workflows V2 per-account capacity disclosed with the launch: up to 50,000 concurrent workflow instances and 300 new instances per second. (Source: Cloudflare Dynamic Workflows.)
Design axes¶
- Save-point granularity. Per-step (workflow engines) vs
developer-chosen (
ctx.stashin Project Think's fibers). Per- step reduces replay cost; developer-chosen reduces ceremony. - Keepalive discipline. If the platform evicts hibernating
actors during active work, the fiber must explicitly
keepAlive()/keepAliveWhile(cond)to prevent eviction during a synchronous long-running segment. Project Think bakes this into the fiber primitive. - Long-callback shape. For work that spans hours or days (video generation, CI pipelines, human review), the idiom is start the work → persist the external job ID → hibernate → wake on callback. The durable execution engine doesn't hold the compute; it holds the bookmark.
- Observability during replay. Clients watching a fiber stream (WebSockets, SSE) see a gap on crash + reconnect vs a seamless resume. Resumable streams (Project Think: "resumable streams") are the client-side half of the durable-execution story.
Cost framing¶
- Storage cost — fiber registrations + checkpoints + event logs persist in SQLite (DO) / a dedicated log store. Per-checkpoint cost is small but sums with fiber volume.
- Replay cost — if save points are coarse, crash replay re-executes many steps; side-effectful replays must be idempotent.
- Developer cost — writing checkpoint-friendly code is a discipline; forgetting to stash a crucial piece of state makes recovery incomplete. Workflow engines enforce this via determinism; Project Think doesn't.
Contrast with retry-on-failure¶
Retry is "restart the whole operation" — acceptable when the operation is short + stateless. Durable execution is "resume from mid-operation" — necessary when the operation is long or has committed intermediate state (sent emails, created tickets, paid for tokens).
Seen in¶
-
sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine — canonical wiki instance of durable execution packaged as an embedded library (see concepts/embedded-workflow-engine), not a separate orchestration cluster. Airbnb's Skipper exposes durable execution via plain Java/Kotlin classes with a 5-annotation contract (
@WorkflowMethod,@StateField,@SignalMethod,@Execute(checkpoint = true),@Compensate), shares the host service's database for workflow state, and runs entirely in-process on a dedicated thread pool. The durability mechanism is replay from checkpointed actions — state fields persisted directly, no event log; previously completed actions short-circuit on replay via stored results. The delayed timeout task is the happy-path zero-overhead mechanism: ~2 DB writes at start, batched checkpoints, no coordinator round-trips per activity, scheduler picks up crashed workflows after a lease period. Load-bearing operational numbers: 15+ production use cases across Tier 0 services (insurance, payments, media, infrastructure, incentives, wallet), peak 10 000 workflows / second on DynamoDB. The framing distinguishes durable execution at three deployment shapes on wiki: Temporal-class external cluster + event history, Project Think-class actor-in-fiber via Durable Object, and now Skipper-class library-in-service via annotated classes. Airbnb's explicit rejection of external clusters for Tier 0 services ("adding a new critical dependency was problematic. An orchestration cluster outage would mean every dependent service would lose the ability to start or advance workflows") makes this the first wiki-canonical production-scale instance of the embedded shape. -
sources/2025-04-03-redpanda-autonomy-is-the-future-of-infrastructure — Alex Gallego's (Redpanda founder) founder-voice positioning of Redpanda as durable-log substrate for enterprise AI agents: the systems/redpanda-agents-sdk explicitly takes durable execution as its load-bearing design focus, with Redpanda as the distributed log underwriting agent-to-agent communication, trace capture, evaluation replay, logs, metrics, collaborative threads, message sampling, analytics, explainability of actions, and time-travel debugging. Load-bearing framing: "Distributed log - Redpanda storage for durable execution, human-in-the-loop workflows, agent-to-agent communication, trace capture, evaluation replay, logs, metrics, collaborative threads, message sampling, analytics, explainability of actions, time travel debugging, etc." Canonical wiki statement pairing durable execution with log-is- truth as its substrate. Mechanism depth is deferred (commit cadence, recovery RPO, replay-correctness model not disclosed), positioning the post as a design-intent disclosure rather than an internals walkthrough.
-
— canonical wiki disclosure of Temporal's own rehydration mechanism in its own words: "Temporal captures the progress of a workflow execution (or workflow steps) in a log called the history. In case of a crash, Temporal rehydrates the workflow; that is, Temporal restarts the workflow execution, deduplicates the invocation of all activities that have already been executed, and catches up to where it previously left off." Event history = append-only WAL; replay + activity dedup = recovery path. This is the concrete Temporal-specific embodiment of durable execution at the step-sequencer / workflow-engine altitude, complementary to the fiber / co-routine shape canonicalised by Cloudflare Project Think. Savannah Longoria (2022-07-22) Part 1 of a PlanetScale product-pairing tutorial; the post also canonicalises Temporal's four-subsystem cluster decomposition and the SQL-vs-NoSQL persistence-layer trade-off (see concepts/temporal-persistence-layer).
- sources/2026-04-15-cloudflare-project-think-building-the-next-generation-of-ai-agents
— canonical wiki instance.
runFiber()/stash()/onFiberRecoveredprimitive introduced in the Project Think SDK. The "30-second LLM call → 30-minute loop → platform restart" framing is the motivating example. - sources/2025-02-12-flyio-the-exit-interview-jp-phillips — durable execution as orchestrator state-machine substrate. JP Phillips on why flyd needs it: "Once I understood what the product needed to do and look like, having a way to perform deterministic and durable execution felt like a good design. … One of the biggest gains, with how it works in flyd, is knowing we would need to deploy flyd all day, every day. If flyd was in the middle of doing some work, it needed to pick back up right where it left off, post-deploy." flyd's specific embodiment: per-FSM-step records in a BoltDB database, lineage-linked to Cadence at HashiCorp and Compose.io/MongoHQ "recipes." Durable execution is not just an LLM-agent concern — it's the load-bearing property of every deploy-tolerant orchestrator.
- sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple — durable execution as cost-protection for LLM batch jobs. Instacart's Maple builds on Temporal: "Running batch jobs at this scale means errors are inevitable — network issues, provider failures, or bugs can happen mid-run. We use the Temporal durable execution engine to ensure that jobs can resume exactly where they left off without losing any work. This not only protects against data loss but also avoids wasting money on partially completed jobs." Sharpens the motivation beyond agent-loop-crash-recovery: LLM batch APIs bill on submit not on complete, so losing mid-pipeline state doesn't just mean redoing the work, it means re-paying for already-submitted batches. Canonical instance of durable execution as financial risk-reduction, not just reliability primitive.
-
sources/2026-05-01-cloudflare-introducing-dynamic-workflows-durable-execution-that-follows-the-tenant — canonical wiki instance of the fourth shape: platform- hosted engine with dispatched per-tenant code. Cloudflare's Dynamic Workflows library lets a single Worker Loader route every
create()and every subsequentrun(event, step)invocation into the right tenant's code — loaded as a Dynamic Worker at runtime, cached by tenant ID, evicted when idle. The Workflows engine stays unchanged; the library is ~300 lines of envelope-and-unwrap glue (wrapWorkflowBinding({ tenantId })on outbound +createDynamicWorkflowEntrypointon inbound). Structural distinction from the step-sequencer shape: therun()body is per-tenant and dynamic, not statically bound at deploy. Structural distinction from the fiber shape: the durability machinery is the engine's (not co-located in the agent's own actor). First wiki source disclosing the Workflows V2 capacity envelope (50,000 concurrent instances per account, 300 new instances per second). CI/CD showcase (per-repoCIPipeline extends WorkflowEntrypointin.cloudflare/ci.ts) is the canonical end-to-end realisation; composes with Artifacts + ArtifactFS for the workspace, Dynamic Workers for each lightweight step, and Sandboxes for heavy corners (docker build, integration suites, Rust compiles). Pre-announces the same dynamic-binding pattern for queues, caches, databases, object stores, AI bindings, and MCP servers. Canonicalises concepts/per-tenant-dynamic-code-dispatch and concepts/envelope-wrap-and-unwrap-metadata-routing alongside patterns/dynamic-binding-over-static-binding. -
sources/2026-06-17-cloudflare-bringing-more-agent-harnesses-and-frameworks-to-cloudflare — extends the fibers primitive disclosure: positions
runFiber()/stash()/onFiberRecovered()as the runtime-level durable-execution mechanism that any harness (Project Think, Pi) or framework (Flue) can build on. Flue adds a framework-level complement: Durable Streams (append-only event log) for the same crash-recovery goal at a higher abstraction. Operational numbers: Code Mode isolate cold start <10 ms, $0.002/load.
Related¶
- systems/project-think — fibers + sessions + durable execution baked into the agent SDK.
- systems/cloudflare-durable-objects — the actor + SQLite substrate fiber registrations live in.
- systems/cloudflare-workflows — sibling durable-execution tier at the orchestration layer (not agent-loop scope).
- systems/temporal — step-sequencer exemplar (separate orchestration process, strict determinism).
- systems/cadence — Temporal's predecessor and direct ancestor of the flyd FSM design.
- systems/flyd — canonical orchestrator-level wiki instance of durable execution (per-FSM-step BoltDB records for Fly Machine lifecycle operations).
- systems/maple-instacart — canonical cost-sensitive LLM batch pipeline wiki instance (Temporal-based).
- patterns/checkpoint-resumable-fiber — the fiber-side implementation pattern.
- patterns/llm-batch-processing-service — LLM-batch specialisation.
- concepts/llm-batch-api — the API surface whose bill-on-submit semantics amplifies durable-execution's value.
- concepts/wal-write-ahead-logging — the underlying durability discipline applied to user-space execution.
- concepts/logless-reconfiguration — sibling durability property at a different granularity (consensus reconfiguration).