How we built saga rollbacks for Cloudflare Workflows¶
Summary¶
Cloudflare shipped native saga rollback support for Workflows, allowing developers to declare per-step compensation logic directly as an option on step.do(). When a Workflow fails terminally, the engine automatically invokes registered rollback handlers in reverse step-start order. The post details the API design journey (fluent API → builder API → options object), the internal execution model (durable step history + callable RPC stubs + replay-based recovery), and the interaction between rollback and Workers RPC's promise pipelining semantics.
Key Takeaways¶
-
Saga rollback is now a first-class
step.do()option. Developers pass{ rollback: async ({ output }) => { ... } }as the third argument. No separate catch blocks, no manual ordering, no ad-hoc cleanup logic. (Section: intro + API design) -
Rollback executes in reverse step-start order, not completion order. For parallel steps where completion order is non-deterministic, the engine uses the persisted start order as the stable sequencing source of truth. (Section: ordering)
-
The failing step itself may need rollback. A step that partially interacted with an external system before failing is still rollback-eligible. Handlers must handle
output === undefinedbecause the step may have failed before persisting a value. (Section: detail 1) -
Rollback only fires on terminal Workflow failure. If user code catches an error and the Workflow continues, no rollback triggers. Rollback starts only when the Workflow is about to fail terminally. (Section: detail 2)
-
Fluent API (
step.do(...).rollback(...)) was rejected because of promise pipelining. In Workers RPC,.rollback()chained on the returned Promise would look like a pipelined call on the step's output — conflating step metadata with step result. It would also complicate step timing: the engine would need to wait to see if.rollback()gets attached before starting the step. (Section: fluent API rejection) -
Builder API (
step.saga("name").do(...).rollback(...).run()) was rejected for ceremony. Forgetting.run()would silently drop the step; the pattern introduces a new builder type that makesstep.do()feel like a legacy API. (Section: builder API rejection) -
Under the hood: durable step history + callable RPC stubs. The engine records whether each step registered a rollback handler. The rollback function itself is kept as a Workers RPC stub — a callable reference that can outlive the immediate
step.do()call. (Section: implementation) -
Recovery via replay. If the engine restarts and in-memory stubs are lost, Workflows uses its standard replay mechanism — re-running the Workflow code, reading persisted results instead of re-executing forward step bodies — to rebuild the callable rollback stubs for eligible steps. (Section: recovery after restart)
-
Rollback handlers get the same durable machinery as forward steps. Each compensation runs through Workflows' normal step infrastructure: retries, timeouts, lifecycle events, logs.
rollbackConfiglets you set retry limits, delay, backoff, and timeout per handler. (Section: rollbackConfig) -
If a rollback handler exhausts retries, the Workflow enters the Errored state and remaining handlers do not execute. This is the explicit stop condition for the reverse walk. (Section: implementation)
Systems Extracted¶
- systems/cloudflare-workflows — the durable-execution engine receiving saga rollback as a new capability
- systems/cloudflare-workers — the runtime substrate; Workers RPC provides the stub/pipelining semantics that constrained API design
- systems/cloudflare-durable-objects — implied backing store for step history persistence
Concepts Extracted¶
- concepts/workflow-compensation-action — compensation elevated to a first-class engine primitive (extends Skipper's
@Compensateto Cloudflare's{ rollback }option) - concepts/durable-execution — Workflows' core property that makes rollback recoverable across restarts
- concepts/workflow-replay-from-checkpointed-actions — replay mechanism used to rebuild rollback stubs after crash
- concepts/idempotent-operations — stated requirement for rollback handlers ("use the payment provider's idempotency key")
- concepts/promise-pipelining — Workers RPC's Cap'n Proto-inherited pattern that made the fluent API semantically ambiguous
Patterns Extracted¶
- patterns/saga-over-long-transaction — the overarching pattern; Workflows is now a native implementation substrate
- patterns/saga-rollback-as-step-metadata — new: rollback declared as metadata on the durable step definition, not a separate handler or catch block
- patterns/compensation-stub-recovery-via-replay — new: replay the workflow code to rebuild callable compensation references after engine restart
Operational Numbers¶
- Rollback retry config supports: configurable
retries.limit,retries.delay,retries.backoff(exponential), and per-handlertimeout. - No throughput/scale numbers disclosed in this post (see prior Workflows V2 post for 50K concurrent instances, 300 new/sec).
Caveats¶
- Sequential rollback only (parallel compensation is listed as future work).
- If a rollback handler fails after exhausting retries, remaining handlers are skipped — no partial-compensation-with-continuation model.
- The post does not disclose how the engine handles cases where replay fails to reach a step that registered rollback (e.g., non-deterministic Workflow code that takes a different branch on replay).
Source¶
- Original: https://blog.cloudflare.com/rollbacks-for-workflows/
- Raw markdown:
raw/cloudflare/2026-06-25-how-we-built-saga-rollbacks-for-cloudflare-workflows-19b02940.md
Related¶
- systems/cloudflare-workflows
- systems/cloudflare-workers
- concepts/workflow-compensation-action
- concepts/durable-execution
- concepts/workflow-replay-from-checkpointed-actions
- concepts/promise-pipelining
- patterns/saga-over-long-transaction
- patterns/saga-rollback-as-step-metadata
- patterns/compensation-stub-recovery-via-replay