Skip to content

PATTERN Cited by 1 source

Compensation stub recovery via replay

Intent

Rebuild in-memory callable references to compensation (rollback) handlers after an engine crash or restart, by replaying the workflow code and re-encountering the step.do() calls that originally registered them — without re-executing the forward step bodies.

Context

In a durable workflow engine, rollback handlers are registered as in-memory callable references (stubs) when forward steps are declared. If the engine is evicted, crashes, or restarts, these in-memory references are lost. However, the engine still knows which steps registered rollback from its persisted step history. The problem: how to recover the callable handlers without re-invoking external side effects.

Solution

Use the engine's existing replay mechanism:

  1. On recovery, re-run the Workflow code from the beginning.
  2. When replay encounters a completed step.do(), read the persisted result instead of re-executing the step body (standard replay behavior).
  3. When replay encounters a step.do() with a rollback option whose step is eligible for compensation, re-register the rollback stub from the options object.
  4. Once replay completes, the engine has rebuilt all callable handlers it needs to execute the reverse walk.

The key insight: the rollback function is source code co-located with the step definition (not data stored in the durable log). Replay re-encounters the source code, which re-registers the callable reference, which the engine can then invoke.

Properties

  • No serialization of handler code. The engine doesn't persist the text or bytecode of the rollback function. It persists only the fact that rollback was registered + the step output. The handler itself is recovered by running the code.
  • Relies on deterministic replay. The Workflow must take the same code path on replay to encounter the same step.do() registrations. Non-deterministic branching before a rollback-bearing step could prevent recovery.
  • Workers RPC stubs provide the callable reference model. In Cloudflare's implementation, stubs are handles to code running elsewhere, with lifetime management via dup(). Rollback stubs outlive the immediate step.do() scope.

Trade-offs

  • Elegant in the common case — no extra persistence or serialization needed for handler code.
  • Fragile under code changes — if a deploy changes the Workflow between the original execution and the crash recovery, replay may encounter different rollback logic than was originally registered.
  • Determinism dependency — non-deterministic Workflow code could prevent replay from reaching the correct step registrations.

Seen in

  • sources/2026-06-25-cloudflare-saga-rollbacks-for-workflows — canonical wiki instance. Cloudflare Workflows uses replay to rebuild rollback stubs after engine restart. "To recover, Workflows uses replay: a recovery mode where it can re-run the Workflow code without re-executing completed forward step bodies. [...] As those step.do() calls are encountered, their rollback options can register the callable stubs again."
Last updated · 559 distilled / 1,651 read