CONCEPT Cited by 1 source
Workflow replay from checkpointed actions¶
Definition¶
Workflow replay from checkpointed actions is a durability
mechanism where a workflow engine recovers from a crash by
re-executing the workflow method from the beginning, but
substitutes previously completed actions' checkpointed results
for the actual side effects. The workflow method runs
top-to-bottom on every recovery; actions that were already
executed return instantly from their stored output rows in the
database; actions that had not yet run execute normally. When
execution reaches a waitUntil or similar hibernation primitive
whose condition isn't yet met, the engine persists current state
and stops consuming resources until a signal, timer, or restart
re-enters the replay loop.
Canonicalised by Airbnb's Skipper (Gamba + Sergiyenko, 2026-04-28). (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)
The Skipper statement verbatim¶
"When a workflow starts, Skipper executes the workflow method and checkpoints each action's result to the database. If the workflow needs to wait (via
waitUntil), Skipper persists the current state and the workflow hibernates, consuming no compute resources. When conditions change — a signal arrives, a timer expires, or the service restarts — Skipper replays the workflow method from the beginning. Previously executed actions don't re-execute; they return their checkpointed results instantly. The workflow picks up from where it left off." (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)
Contrast with event-history replay¶
Temporal replays a workflow by consuming its full event history — an append-only log of every workflow-relevant event — and reconstructing workflow state event-by-event. This preserves full auditability: an operator can see every branch the workflow took, every signal it received, every activity return value, in the order they happened.
Skipper's model is leaner: instead of an event log, it stores state fields directly plus checkpointed action results. On recovery, the engine loads current state, re-runs the workflow method, and uses the stored action outputs to short-circuit the already-completed work. The post:
"Unlike event-sourced orchestration systems that reconstruct state by replaying an entire event history, Skipper persists state fields directly. There's no event log to replay, just current state and checkpointed action results. This makes execution leaner, especially for workflows with many signals or long histories, though it trades some auditability for that efficiency." (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)
Trade:
| Property | Event-history replay (Temporal) | State-field replay (Skipper) |
|---|---|---|
| Storage growth | Linear in event count | Bounded by current state |
| Replay cost | Linear in event count | Linear in workflow-method calls |
| Auditability | Full (every event visible) | Current state only |
| Signal-heavy workloads | Event log grows per signal | Last signal's state only |
| Long-lived workflows | History re-sized via snapshots | Naturally bounded |
Required invariants¶
For replay to be correct:
- Workflow determinism. Same inputs + checkpointed action results + state fields must produce the same decisions + same action- call sequence on every replay.
- Action-level atomicity wrt checkpoint. An action is either not-yet-executed (runs on next replay) or checkpointed-with-result (short-circuits on next replay). The in-between state — action ran but checkpoint write did not commit — is possible on crash, which is why actions must be idempotent (see at-least-once action execution).
- Side effects isolated to actions. API calls, time reads, random numbers, DB writes must all live inside annotated action methods, never in the workflow method body directly.
Relationship to write-ahead logging¶
Both this mechanism and write-ahead logging solve the same family of problem — survive process crashes without losing committed state — at different altitudes. A WAL persists intent to mutate before mutation; replay on crash re-applies committed log records. Workflow-replay-from- checkpointed-actions persists action-level results before the workflow advances; replay on crash re-runs the workflow method but substitutes stored results for already-completed actions. The action-result table functions as the WAL at the user-space workflow altitude — which is exactly the framing in the concepts/durable-execution page.
The happy-path property¶
Because action results are checkpointed only on success and the replay substitutes them on recovery, a workflow that completes normally pays:
- 2 initial DB writes (workflow-instance row + delayed timeout task — see patterns/delayed-timeout-task-as-crash-safety-net);
- N checkpoint writes, batched (one per action);
- 1 final state update on completion.
No coordinator round-trips. No network hops per activity. The engine earns its durability cost only when crashes actually happen. This is the load-bearing argument for the embedded shape's performance-neutrality claim. (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)
Replay-debugging caveat¶
The post names a real operator-experience friction:
"Debugging replayed workflows also requires mental model adjustment: engineers must understand that log timestamps and call sequences reflect replays, not original execution. Better observability tooling, particularly replay visualization, would help." (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)
Without replay-aware observability, a production log line's timestamp refers to this replay's execution, not the original execution — a source of confusion when triaging workflows that have crashed + recovered multiple times.
Seen in¶
- sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine
— canonical wiki disclosure. Skipper's replay mechanism
canonicalised verbatim: workflow method re-runs from the
start on recovery, checkpointed action results replace
actual side-effect invocations,
waitUntilhibernates until a signal arrives, no event log. Explicit contrast with Temporal-class event-history replay and the efficiency-vs- auditability trade framed by the authors.
Related¶
- systems/airbnb-skipper — canonical instance.
- systems/temporal — event-history replay contrast.
- concepts/durable-execution — the parent property.
- concepts/embedded-workflow-engine — the shape this mechanism is paired with in Skipper.
- concepts/workflow-determinism-requirement — the correctness invariant replay requires.
- concepts/temporal-persistence-layer — the event-history persistence-layer shape this mechanism deliberately departs from.
- concepts/wal-write-ahead-logging — conceptual ancestor: same shape (log-before-apply, replay-on-recovery) at a different altitude.
- concepts/at-least-once-delivery / concepts/idempotent-operations — invariants actions must satisfy because of replay.
- patterns/checkpoint-resumable-fiber — sibling shape in
an actor-model ecosystem (Cloudflare Project Think):
developer-chosen
stashcheckpoints +onFiberRecoveredhook, same replay-from-last-save-point property at a different altitude.