Skip to content

CLOUDFLARE 2026-06-25

Read original ↗

How we built saga rollbacks for Cloudflare Workflows

Summary

Cloudflare shipped native saga rollback support for Workflows, allowing developers to declare per-step compensation logic directly as an option on step.do(). When a Workflow fails terminally, the engine automatically invokes registered rollback handlers in reverse step-start order. The post details the API design journey (fluent API → builder API → options object), the internal execution model (durable step history + callable RPC stubs + replay-based recovery), and the interaction between rollback and Workers RPC's promise pipelining semantics.

Key Takeaways

  1. Saga rollback is now a first-class step.do() option. Developers pass { rollback: async ({ output }) => { ... } } as the third argument. No separate catch blocks, no manual ordering, no ad-hoc cleanup logic. (Section: intro + API design)

  2. Rollback executes in reverse step-start order, not completion order. For parallel steps where completion order is non-deterministic, the engine uses the persisted start order as the stable sequencing source of truth. (Section: ordering)

  3. The failing step itself may need rollback. A step that partially interacted with an external system before failing is still rollback-eligible. Handlers must handle output === undefined because the step may have failed before persisting a value. (Section: detail 1)

  4. Rollback only fires on terminal Workflow failure. If user code catches an error and the Workflow continues, no rollback triggers. Rollback starts only when the Workflow is about to fail terminally. (Section: detail 2)

  5. Fluent API (step.do(...).rollback(...)) was rejected because of promise pipelining. In Workers RPC, .rollback() chained on the returned Promise would look like a pipelined call on the step's output — conflating step metadata with step result. It would also complicate step timing: the engine would need to wait to see if .rollback() gets attached before starting the step. (Section: fluent API rejection)

  6. Builder API (step.saga("name").do(...).rollback(...).run()) was rejected for ceremony. Forgetting .run() would silently drop the step; the pattern introduces a new builder type that makes step.do() feel like a legacy API. (Section: builder API rejection)

  7. Under the hood: durable step history + callable RPC stubs. The engine records whether each step registered a rollback handler. The rollback function itself is kept as a Workers RPC stub — a callable reference that can outlive the immediate step.do() call. (Section: implementation)

  8. Recovery via replay. If the engine restarts and in-memory stubs are lost, Workflows uses its standard replay mechanism — re-running the Workflow code, reading persisted results instead of re-executing forward step bodies — to rebuild the callable rollback stubs for eligible steps. (Section: recovery after restart)

  9. Rollback handlers get the same durable machinery as forward steps. Each compensation runs through Workflows' normal step infrastructure: retries, timeouts, lifecycle events, logs. rollbackConfig lets you set retry limits, delay, backoff, and timeout per handler. (Section: rollbackConfig)

  10. If a rollback handler exhausts retries, the Workflow enters the Errored state and remaining handlers do not execute. This is the explicit stop condition for the reverse walk. (Section: implementation)

Systems Extracted

Concepts Extracted

Patterns Extracted

Operational Numbers

  • Rollback retry config supports: configurable retries.limit, retries.delay, retries.backoff (exponential), and per-handler timeout.
  • No throughput/scale numbers disclosed in this post (see prior Workflows V2 post for 50K concurrent instances, 300 new/sec).

Caveats

  • Sequential rollback only (parallel compensation is listed as future work).
  • If a rollback handler fails after exhausting retries, remaining handlers are skipped — no partial-compensation-with-continuation model.
  • The post does not disclose how the engine handles cases where replay fails to reach a step that registered rollback (e.g., non-deterministic Workflow code that takes a different branch on replay).

Source

Last updated · 559 distilled / 1,651 read