PATTERN Cited by 1 source
Saga over long-running transaction¶
Intent¶
Decompose a logically-atomic multi-step workflow from a single long-running database transaction into a sequence of short local transactions connected by compensating actions, so that no single transaction exceeds the platform's hard transaction ceiling (e.g. PlanetScale's 20 s tx cap).
Context¶
Managed database platforms (PlanetScale / Vitess, Aurora Limitless under some configurations, multi-tenant shared clusters generally) enforce hard transaction timeouts as admission control: a single tenant's long-held transaction pins connection-pool slots and row-locks that every other tenant on the same node depends on. The platform's operational guarantees (tail latency, fail-over speed, lock-contention envelope) assume transactions are short. Applications that need to perform multi-step workflows — "charge the card, create the order, reserve inventory, send confirmation email" — have traditionally wrapped the whole thing in one transaction for atomicity. On a platform with a hard tx cap, that pattern breaks.
Forces¶
- Atomicity: the workflow needs to either complete fully or leave the system in a consistent pre-workflow state.
- Platform-imposed timeout: the platform will kill any transaction older than some bound (20 s on PlanetScale), regardless of whether the workflow has more work to do.
- Multi-step latency: the workflow includes slow steps (external API calls, user-visible confirmation UIs, batch processing) that can't be shortened below the timeout.
- Multi-row / multi-table invariants: the atomicity invariant spans multiple rows or tables, so optimistic locking on a single row is not sufficient.
Solution¶
Decompose the workflow into:
- N local transactions, each hitting only one service or one logical unit, each committing in milliseconds.
- N-1 compensating actions, one for each step after the first, that semantically undo that step's effect if a later step fails.
- An orchestrator or choreography that advances through the sequence forward on success or executes compensations in reverse on failure.
Each individual transaction fits well inside the 20 s cap; the logical atomicity is preserved by the compensations rather than by holding locks.
Canonical reference¶
PlanetScale Support's ladder positions sagas as the recommended fix for complex multi-step workflows that can't fit into a single 20 s transaction (Source: sources/2026-04-21-planetscale-supports-notes-from-the-field):
For more complex workflows, consider adopting Sagas
The linked reference is Chris Richardson's Microservices Patterns chapter 4, which is the canonical systems-community articulation of the pattern. The same advice appears in almost every managed-database platform's documentation when hard tx timeouts exist.
Implementation shapes¶
Orchestration-based saga¶
One service (the orchestrator) owns the workflow state and drives each step explicitly:
OrderSaga:
1. PaymentService.charge(order) → on fail: exit
2. OrderService.create(order) → on fail: Payment.refund + exit
3. InventoryService.reserve(order)→ on fail: Order.cancel + Payment.refund + exit
4. NotificationService.send(order)→ on fail: (log, non-critical)
Each step is a short local transaction, each compensation is also a short local transaction, and the orchestrator persists its position in durable state so it can recover after crash. Temporal and Cadence are canonical orchestration substrates for this.
Choreography-based saga¶
No central orchestrator; each service listens for domain events and emits new ones, triggering the next step:
PaymentService.charged → OrderService subscribes, creates
OrderService.created → InventoryService subscribes, reserves
InventoryService.reserved → NotificationService subscribes, sends
Any *.failed event → upstream services subscribe + compensate
Chattier, harder to reason about, but no orchestrator single-point-of-failure. Canonical implementation substrate: Kafka / event streams + transactional outbox on each service.
Trade-offs¶
- Weaker consistency than a 2PC distributed transaction. Between step 2 and step 3 an external observer can see the system in a state where step 2 is committed but step 3 isn't. Application must be designed for this.
- Compensating actions ≠ rollback. A compensation is a
semantic undo, not a database rollback. Charging $100 and
compensating with
refund($100)is not identical to "no charge ever happened" — there are receipts, audit trails, and timing gaps. Some actions (sending email, shipping goods) have no compensating action and must be moved to the end of the saga. - Increased code complexity. N forward steps + N-1 compensations + orchestration logic is substantially more code than one transaction.
- Testing discipline required. Every failure-injection path must be tested individually (payment fails; order creates, inventory fails; …). Microservices-patterns literature calls this combinatorial-failure testing.
When to use¶
- Platform enforces hard transaction timeouts and the workflow provably exceeds them.
- Workflow spans multiple services that don't share a database.
- Workflow includes slow external API calls or human-in-the-loop steps that can't be shortened.
When not to use¶
- Workflow fits well inside the 20 s tx cap — keep the transaction. Simpler, atomic by construction.
- Workflow touches a single row and optimistic locking suffices — use optimistic locking instead, lower complexity.
- Workflow is actually analytics (large scan + aggregate
- write-back) — it's not a transaction at all, it's an ETL job. Route to Airbyte / Stitch / an OLAP warehouse.
Seen in¶
- sources/2026-04-21-planetscale-supports-notes-from-the-field — canonical wiki instance of sagas positioned explicitly as the fix for 20 s transaction-timeout breaches in complex multi-step workflows. PlanetScale Support's own ladder: shorten the tx → optimistic locking (simple case) → sagas (complex multi-step case) → ETL offload to Airbyte/Stitch (OLAP case). Links directly to Richardson's Microservices Patterns chapter 4 as the reference.