PATTERN Cited by 1 source

Saga over long-running transaction¶

Intent¶

Decompose a logically-atomic multi-step workflow from a single long-running database transaction into a sequence of short local transactions connected by compensating actions, so that no single transaction exceeds the platform's hard transaction ceiling (e.g. PlanetScale's 20 s tx cap).

Context¶

Managed database platforms (PlanetScale / Vitess, Aurora Limitless under some configurations, multi-tenant shared clusters generally) enforce hard transaction timeouts as admission control: a single tenant's long-held transaction pins connection-pool slots and row-locks that every other tenant on the same node depends on. The platform's operational guarantees (tail latency, fail-over speed, lock-contention envelope) assume transactions are short. Applications that need to perform multi-step workflows — "charge the card, create the order, reserve inventory, send confirmation email" — have traditionally wrapped the whole thing in one transaction for atomicity. On a platform with a hard tx cap, that pattern breaks.

Forces¶

Atomicity: the workflow needs to either complete fully or leave the system in a consistent pre-workflow state.
Platform-imposed timeout: the platform will kill any transaction older than some bound (20 s on PlanetScale), regardless of whether the workflow has more work to do.
Multi-step latency: the workflow includes slow steps (external API calls, user-visible confirmation UIs, batch processing) that can't be shortened below the timeout.
Multi-row / multi-table invariants: the atomicity invariant spans multiple rows or tables, so optimistic locking on a single row is not sufficient.

Solution¶

Decompose the workflow into:

N local transactions, each hitting only one service or one logical unit, each committing in milliseconds.
N-1 compensating actions, one for each step after the first, that semantically undo that step's effect if a later step fails.
An orchestrator or choreography that advances through the sequence forward on success or executes compensations in reverse on failure.

Each individual transaction fits well inside the 20 s cap; the logical atomicity is preserved by the compensations rather than by holding locks.

Canonical reference¶

PlanetScale Support's ladder positions sagas as the recommended fix for complex multi-step workflows that can't fit into a single 20 s transaction (Source: sources/2026-04-21-planetscale-supports-notes-from-the-field):

For more complex workflows, consider adopting Sagas

The linked reference is Chris Richardson's Microservices Patterns chapter 4, which is the canonical systems-community articulation of the pattern. The same advice appears in almost every managed-database platform's documentation when hard tx timeouts exist.

Implementation shapes¶

Orchestration-based saga¶

One service (the orchestrator) owns the workflow state and drives each step explicitly:

OrderSaga:
  1. PaymentService.charge(order)   → on fail: exit
  2. OrderService.create(order)     → on fail: Payment.refund + exit
  3. InventoryService.reserve(order)→ on fail: Order.cancel + Payment.refund + exit
  4. NotificationService.send(order)→ on fail: (log, non-critical)

Each step is a short local transaction, each compensation is also a short local transaction, and the orchestrator persists its position in durable state so it can recover after crash. Temporal and Cadence are canonical orchestration substrates for this.

Choreography-based saga¶

No central orchestrator; each service listens for domain events and emits new ones, triggering the next step:

PaymentService.charged → OrderService subscribes, creates
OrderService.created   → InventoryService subscribes, reserves
InventoryService.reserved → NotificationService subscribes, sends
Any *.failed event     → upstream services subscribe + compensate

Chattier, harder to reason about, but no orchestrator single-point-of-failure. Canonical implementation substrate: Kafka / event streams + transactional outbox on each service.

Trade-offs¶

Weaker consistency than a 2PC distributed transaction. Between step 2 and step 3 an external observer can see the system in a state where step 2 is committed but step 3 isn't. Application must be designed for this.
Compensating actions ≠ rollback. A compensation is a semantic undo, not a database rollback. Charging $100 and compensating with refund($100) is not identical to "no charge ever happened" — there are receipts, audit trails, and timing gaps. Some actions (sending email, shipping goods) have no compensating action and must be moved to the end of the saga.
Increased code complexity. N forward steps + N-1 compensations + orchestration logic is substantially more code than one transaction.
Testing discipline required. Every failure-injection path must be tested individually (payment fails; order creates, inventory fails; …). Microservices-patterns literature calls this combinatorial-failure testing.

When to use¶

Platform enforces hard transaction timeouts and the workflow provably exceeds them.
Workflow spans multiple services that don't share a database.
Workflow includes slow external API calls or human-in-the-loop steps that can't be shortened.

When not to use¶

Workflow fits well inside the 20 s tx cap — keep the transaction. Simpler, atomic by construction.
Workflow touches a single row and optimistic locking suffices — use optimistic locking instead, lower complexity.
Workflow is actually analytics (large scan + aggregate
write-back) — it's not a transaction at all, it's an ETL job. Route to Airbyte / Stitch / an OLAP warehouse.

Seen in¶

sources/2026-04-21-planetscale-supports-notes-from-the-field — canonical wiki instance of sagas positioned explicitly as the fix for 20 s transaction-timeout breaches in complex multi-step workflows. PlanetScale Support's own ladder: shorten the tx → optimistic locking (simple case) → sagas (complex multi-step case) → ETL offload to Airbyte/Stitch (OLAP case). Links directly to Richardson's Microservices Patterns chapter 4 as the reference.