Skip to content

CONCEPT Cited by 1 source

Workflow replay from checkpointed actions

Definition

Workflow replay from checkpointed actions is a durability mechanism where a workflow engine recovers from a crash by re-executing the workflow method from the beginning, but substitutes previously completed actions' checkpointed results for the actual side effects. The workflow method runs top-to-bottom on every recovery; actions that were already executed return instantly from their stored output rows in the database; actions that had not yet run execute normally. When execution reaches a waitUntil or similar hibernation primitive whose condition isn't yet met, the engine persists current state and stops consuming resources until a signal, timer, or restart re-enters the replay loop.

Canonicalised by Airbnb's Skipper (Gamba + Sergiyenko, 2026-04-28). (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)

The Skipper statement verbatim

"When a workflow starts, Skipper executes the workflow method and checkpoints each action's result to the database. If the workflow needs to wait (via waitUntil), Skipper persists the current state and the workflow hibernates, consuming no compute resources. When conditions change — a signal arrives, a timer expires, or the service restarts — Skipper replays the workflow method from the beginning. Previously executed actions don't re-execute; they return their checkpointed results instantly. The workflow picks up from where it left off." (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)

Contrast with event-history replay

Temporal replays a workflow by consuming its full event history — an append-only log of every workflow-relevant event — and reconstructing workflow state event-by-event. This preserves full auditability: an operator can see every branch the workflow took, every signal it received, every activity return value, in the order they happened.

Skipper's model is leaner: instead of an event log, it stores state fields directly plus checkpointed action results. On recovery, the engine loads current state, re-runs the workflow method, and uses the stored action outputs to short-circuit the already-completed work. The post:

"Unlike event-sourced orchestration systems that reconstruct state by replaying an entire event history, Skipper persists state fields directly. There's no event log to replay, just current state and checkpointed action results. This makes execution leaner, especially for workflows with many signals or long histories, though it trades some auditability for that efficiency." (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)

Trade:

Property Event-history replay (Temporal) State-field replay (Skipper)
Storage growth Linear in event count Bounded by current state
Replay cost Linear in event count Linear in workflow-method calls
Auditability Full (every event visible) Current state only
Signal-heavy workloads Event log grows per signal Last signal's state only
Long-lived workflows History re-sized via snapshots Naturally bounded

Required invariants

For replay to be correct:

  • Workflow determinism. Same inputs + checkpointed action results + state fields must produce the same decisions + same action- call sequence on every replay.
  • Action-level atomicity wrt checkpoint. An action is either not-yet-executed (runs on next replay) or checkpointed-with-result (short-circuits on next replay). The in-between state — action ran but checkpoint write did not commit — is possible on crash, which is why actions must be idempotent (see at-least-once action execution).
  • Side effects isolated to actions. API calls, time reads, random numbers, DB writes must all live inside annotated action methods, never in the workflow method body directly.

Relationship to write-ahead logging

Both this mechanism and write-ahead logging solve the same family of problem — survive process crashes without losing committed state — at different altitudes. A WAL persists intent to mutate before mutation; replay on crash re-applies committed log records. Workflow-replay-from- checkpointed-actions persists action-level results before the workflow advances; replay on crash re-runs the workflow method but substitutes stored results for already-completed actions. The action-result table functions as the WAL at the user-space workflow altitude — which is exactly the framing in the concepts/durable-execution page.

The happy-path property

Because action results are checkpointed only on success and the replay substitutes them on recovery, a workflow that completes normally pays:

No coordinator round-trips. No network hops per activity. The engine earns its durability cost only when crashes actually happen. This is the load-bearing argument for the embedded shape's performance-neutrality claim. (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)

Replay-debugging caveat

The post names a real operator-experience friction:

"Debugging replayed workflows also requires mental model adjustment: engineers must understand that log timestamps and call sequences reflect replays, not original execution. Better observability tooling, particularly replay visualization, would help." (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)

Without replay-aware observability, a production log line's timestamp refers to this replay's execution, not the original execution — a source of confusion when triaging workflows that have crashed + recovered multiple times.

Seen in

  • sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine — canonical wiki disclosure. Skipper's replay mechanism canonicalised verbatim: workflow method re-runs from the start on recovery, checkpointed action results replace actual side-effect invocations, waitUntil hibernates until a signal arrives, no event log. Explicit contrast with Temporal-class event-history replay and the efficiency-vs- auditability trade framed by the authors.
Last updated · 433 distilled / 1,256 read