Skip to content

CONCEPT Cited by 7 sources

Durable execution

Durable execution is the property of a long-running computation (multi-minute LLM loops, multi-hour CI pipelines, multi-day workflows) that it survives any interruption of its host environment — process crash, platform restart, deploy, resource-limit eviction — and resumes from a known safe point without the caller observing failure, and ideally without losing committed intermediate results.

The general shape: every state-changing step is recorded in durable storage before execution begins (registered + sequenced), checkpointed at explicit save points, and replayed from the last checkpoint on recovery. Effectively write-ahead logging (concepts/wal-write-ahead-logging) applied to a unit of user- space execution.

Why agents force this problem

"An LLM call takes 30 seconds. A multi-turn agent loop can run for much longer. At any point during that window, the execution environment can vanish: a deploy, a platform restart, hitting resource limits. The upstream connection to the model provider is severed permanently, in-memory state is lost, and connected clients see the stream stop with no explanation." (Source: Cloudflare Project Think.)

Agents compound two properties that stateless request handlers don't share:

  1. Long wall-clock runtime per unit of work — not because of computation, but because each LLM call is seconds and each tool call is potentially more.
  2. Accumulated multi-step state — mid-loop findings, partial artifacts, tool-call results — that cannot be easily reconstructed from inputs alone.

Together they make "restart from scratch on crash" prohibitively expensive. Durable execution makes the crash invisible to the user (and to the downstream LLM, if the platform hides the reconnection).

Three implementation shapes

Step-sequencer / workflow engine

A separate orchestration tier — Temporal, AWS Step Functions, Cadence, Cloudflare Workflows — records each step's input + output + determinism-critical randomness in a durable event log. Replay on recovery: re-run from the first non-committed step, with "already-executed" steps replayed from log rather than re-invoked.

Fiber / co-routine-in-actor

The durable-execution primitive lives inside the agent's own actor (a Durable Object for Cloudflare), not in a separate orchestration tier. Project Think's runFiber("name", async (ctx) => { … }) registers the fiber in the DO's SQLite before execution begins; ctx.stash({ … }) checkpoints arbitrary user-defined state; onFiberRecovered is called with the last stashed snapshot on restart. No separate event log, no separate worker process — the actor is the execution unit. See patterns/checkpoint-resumable-fiber.

The fiber shape trades step-level determinism enforcement (Temporal's hard rule) for lower ceremony — a regular async function with occasional ctx.stash() calls becomes durable. The developer chooses what to checkpoint and what to re-compute; the engine doesn't enforce a determinism contract.

Embedded library-in-service

The workflow engine is a library dependency on the host service, not a separate cluster and not an actor-inside-an- actor-substrate. Airbnb's Skipper canonicalises this shape: workflows and actions are plain Java/Kotlin classes with annotations (@WorkflowMethod / @StateField / @SignalMethod / @Execute(checkpoint = true) / @Compensate), state lives in the host service's existing database (MySQL / UDS / DynamoDB), and the engine runs in-process on a dedicated thread pool. See concepts/embedded-workflow-engine + patterns/workflow-primitives-as-annotated-classes.

Durability uses state- field replay instead of event-history replay — leaner but with weaker auditability. The happy-path overhead is near zero via the delayed timeout task pattern: 2 DB writes at workflow start, batched action checkpoints, and a scheduled safety-net task that fires harmlessly on completion or triggers replay on crash — no coordinator round-trips per activity. Trades: no cross-language support, no cross-service orchestration.

Platform-hosted engine + dispatched per-tenant code

Cloudflare's Dynamic Workflows (2026-05-01) makes explicit a variant that the three shapes above left implicit: the platform hosts the workflow engine (Workflows V2), but the run(event, step) body itself is dispatched per tenant at invocation time rather than statically bound at deploy time. A single Worker Loader routes every create() call and every subsequent run() invocation into the right tenant's code, loaded on demand as a Dynamic Worker and cached by tenant ID.

This variant sits between the step-sequencer and fiber shapes: the durability machinery is the engine's (IDs, step.sleep(), step.waitForEvent(), retries, hibernation, replay); but the body being run is per-tenant, versioned in the tenant's repo, loaded as code on each step boundary. Canonicalised in:

Workflows V2 per-account capacity disclosed with the launch: up to 50,000 concurrent workflow instances and 300 new instances per second. (Source: Cloudflare Dynamic Workflows.)

Design axes

  • Save-point granularity. Per-step (workflow engines) vs developer-chosen (ctx.stash in Project Think's fibers). Per- step reduces replay cost; developer-chosen reduces ceremony.
  • Keepalive discipline. If the platform evicts hibernating actors during active work, the fiber must explicitly keepAlive() / keepAliveWhile(cond) to prevent eviction during a synchronous long-running segment. Project Think bakes this into the fiber primitive.
  • Long-callback shape. For work that spans hours or days (video generation, CI pipelines, human review), the idiom is start the work → persist the external job ID → hibernate → wake on callback. The durable execution engine doesn't hold the compute; it holds the bookmark.
  • Observability during replay. Clients watching a fiber stream (WebSockets, SSE) see a gap on crash + reconnect vs a seamless resume. Resumable streams (Project Think: "resumable streams") are the client-side half of the durable-execution story.

Cost framing

  • Storage cost — fiber registrations + checkpoints + event logs persist in SQLite (DO) / a dedicated log store. Per-checkpoint cost is small but sums with fiber volume.
  • Replay cost — if save points are coarse, crash replay re-executes many steps; side-effectful replays must be idempotent.
  • Developer cost — writing checkpoint-friendly code is a discipline; forgetting to stash a crucial piece of state makes recovery incomplete. Workflow engines enforce this via determinism; Project Think doesn't.

Contrast with retry-on-failure

Retry is "restart the whole operation" — acceptable when the operation is short + stateless. Durable execution is "resume from mid-operation" — necessary when the operation is long or has committed intermediate state (sent emails, created tickets, paid for tokens).

Seen in

  • sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine — canonical wiki instance of durable execution packaged as an embedded library (see concepts/embedded-workflow-engine), not a separate orchestration cluster. Airbnb's Skipper exposes durable execution via plain Java/Kotlin classes with a 5-annotation contract (@WorkflowMethod, @StateField, @SignalMethod, @Execute(checkpoint = true), @Compensate), shares the host service's database for workflow state, and runs entirely in-process on a dedicated thread pool. The durability mechanism is replay from checkpointed actions — state fields persisted directly, no event log; previously completed actions short-circuit on replay via stored results. The delayed timeout task is the happy-path zero-overhead mechanism: ~2 DB writes at start, batched checkpoints, no coordinator round-trips per activity, scheduler picks up crashed workflows after a lease period. Load-bearing operational numbers: 15+ production use cases across Tier 0 services (insurance, payments, media, infrastructure, incentives, wallet), peak 10 000 workflows / second on DynamoDB. The framing distinguishes durable execution at three deployment shapes on wiki: Temporal-class external cluster + event history, Project Think-class actor-in-fiber via Durable Object, and now Skipper-class library-in-service via annotated classes. Airbnb's explicit rejection of external clusters for Tier 0 services ("adding a new critical dependency was problematic. An orchestration cluster outage would mean every dependent service would lose the ability to start or advance workflows") makes this the first wiki-canonical production-scale instance of the embedded shape.

  • sources/2025-04-03-redpanda-autonomy-is-the-future-of-infrastructure — Alex Gallego's (Redpanda founder) founder-voice positioning of Redpanda as durable-log substrate for enterprise AI agents: the systems/redpanda-agents-sdk explicitly takes durable execution as its load-bearing design focus, with Redpanda as the distributed log underwriting agent-to-agent communication, trace capture, evaluation replay, logs, metrics, collaborative threads, message sampling, analytics, explainability of actions, and time-travel debugging. Load-bearing framing: "Distributed log - Redpanda storage for durable execution, human-in-the-loop workflows, agent-to-agent communication, trace capture, evaluation replay, logs, metrics, collaborative threads, message sampling, analytics, explainability of actions, time travel debugging, etc." Canonical wiki statement pairing durable execution with log-is- truth as its substrate. Mechanism depth is deferred (commit cadence, recovery RPO, replay-correctness model not disclosed), positioning the post as a design-intent disclosure rather than an internals walkthrough.

  • canonical wiki disclosure of Temporal's own rehydration mechanism in its own words: "Temporal captures the progress of a workflow execution (or workflow steps) in a log called the history. In case of a crash, Temporal rehydrates the workflow; that is, Temporal restarts the workflow execution, deduplicates the invocation of all activities that have already been executed, and catches up to where it previously left off." Event history = append-only WAL; replay + activity dedup = recovery path. This is the concrete Temporal-specific embodiment of durable execution at the step-sequencer / workflow-engine altitude, complementary to the fiber / co-routine shape canonicalised by Cloudflare Project Think. Savannah Longoria (2022-07-22) Part 1 of a PlanetScale product-pairing tutorial; the post also canonicalises Temporal's four-subsystem cluster decomposition and the SQL-vs-NoSQL persistence-layer trade-off (see concepts/temporal-persistence-layer).

  • sources/2026-04-15-cloudflare-project-think-building-the-next-generation-of-ai-agents — canonical wiki instance. runFiber() / stash() / onFiberRecovered primitive introduced in the Project Think SDK. The "30-second LLM call → 30-minute loop → platform restart" framing is the motivating example.
  • sources/2025-02-12-flyio-the-exit-interview-jp-phillipsdurable execution as orchestrator state-machine substrate. JP Phillips on why flyd needs it: "Once I understood what the product needed to do and look like, having a way to perform deterministic and durable execution felt like a good design. … One of the biggest gains, with how it works in flyd, is knowing we would need to deploy flyd all day, every day. If flyd was in the middle of doing some work, it needed to pick back up right where it left off, post-deploy." flyd's specific embodiment: per-FSM-step records in a BoltDB database, lineage-linked to Cadence at HashiCorp and Compose.io/MongoHQ "recipes." Durable execution is not just an LLM-agent concern — it's the load-bearing property of every deploy-tolerant orchestrator.
  • sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-mapledurable execution as cost-protection for LLM batch jobs. Instacart's Maple builds on Temporal: "Running batch jobs at this scale means errors are inevitable — network issues, provider failures, or bugs can happen mid-run. We use the Temporal durable execution engine to ensure that jobs can resume exactly where they left off without losing any work. This not only protects against data loss but also avoids wasting money on partially completed jobs." Sharpens the motivation beyond agent-loop-crash-recovery: LLM batch APIs bill on submit not on complete, so losing mid-pipeline state doesn't just mean redoing the work, it means re-paying for already-submitted batches. Canonical instance of durable execution as financial risk-reduction, not just reliability primitive.
  • sources/2026-05-01-cloudflare-introducing-dynamic-workflows-durable-execution-that-follows-the-tenant — canonical wiki instance of the fourth shape: platform- hosted engine with dispatched per-tenant code. Cloudflare's Dynamic Workflows library lets a single Worker Loader route every create() and every subsequent run(event, step) invocation into the right tenant's code — loaded as a Dynamic Worker at runtime, cached by tenant ID, evicted when idle. The Workflows engine stays unchanged; the library is ~300 lines of envelope-and-unwrap glue (wrapWorkflowBinding({ tenantId }) on outbound + createDynamicWorkflowEntrypoint on inbound). Structural distinction from the step-sequencer shape: the run() body is per-tenant and dynamic, not statically bound at deploy. Structural distinction from the fiber shape: the durability machinery is the engine's (not co-located in the agent's own actor). First wiki source disclosing the Workflows V2 capacity envelope (50,000 concurrent instances per account, 300 new instances per second). CI/CD showcase (per-repo CIPipeline extends WorkflowEntrypoint in .cloudflare/ci.ts) is the canonical end-to-end realisation; composes with Artifacts + ArtifactFS for the workspace, Dynamic Workers for each lightweight step, and Sandboxes for heavy corners (docker build, integration suites, Rust compiles). Pre-announces the same dynamic-binding pattern for queues, caches, databases, object stores, AI bindings, and MCP servers. Canonicalises concepts/per-tenant-dynamic-code-dispatch and concepts/envelope-wrap-and-unwrap-metadata-routing alongside patterns/dynamic-binding-over-static-binding.

  • sources/2026-06-17-cloudflare-bringing-more-agent-harnesses-and-frameworks-to-cloudflare — extends the fibers primitive disclosure: positions runFiber() / stash() / onFiberRecovered() as the runtime-level durable-execution mechanism that any harness (Project Think, Pi) or framework (Flue) can build on. Flue adds a framework-level complement: Durable Streams (append-only event log) for the same crash-recovery goal at a higher abstraction. Operational numbers: Code Mode isolate cold start <10 ms, $0.002/load.

Last updated · 542 distilled / 1,571 read