PATTERN Cited by 1 source
Transactional Outbox¶
Definition¶
The Transactional Outbox pattern decouples a durable state change from publishing the corresponding event by writing both atomically into the same data store — the primary row plus an outbox entry in one transaction — and then having a separate message relay asynchronously read the outbox and publish the event to the message bus.
This avoids the dual-write problem: if a service writes to its database and then publishes to a message bus in two separate operations, it can succeed at one and fail at the other, leaving the two systems inconsistent. With an outbox, the database transaction is the only commit point; the event is guaranteed to eventually be published because it's already durably persisted.
Canonical four-step shape¶
Zalando's post walks through the pattern as four canonical steps (Source: sources/2022-02-02-zalando-utilizing-amazon-dynamodb-and-aws-lambda-for-asynchronous-event-publication):
- Change entry — the service receives the synchronous request and begins the data-store transaction.
- Populate outbox (step
1.5in Zalando's diagram) — as part of the same transaction, an outbox entry is written describing the change. Either both commit or neither does. - Consume outbox — a message relay reads new outbox entries. Different realisations do this differently: polling, push-based CDC, or (the sweet spot) a native change stream on the same store.
- Publish event (step
2.5) — the relay transforms the outbox entry into a domain event and publishes it to the bus. Only after successful publication is the entry marked consumed.
Client ─▶ Service ─ tx begin ─▶ Table row + outbox entry ─ tx commit ─▶ 200 OK
│
▼
Message relay (async)
│
▼
Message bus
Why it's the right default for async event emission¶
- Removes a hard dependency from the critical path. The service's
synchronous availability is now
A(database), notA(database) × A(bus). Canonicalised in concepts/availability-multiplication-of-dependencies. - No lost events. Events are durably stored before any attempt to publish; relay failure is recoverable.
- No ghost events. If the DB transaction aborts, the outbox entry doesn't exist; nothing ever gets published for a change that didn't commit.
- Consumers see eventual consistency, not inconsistency. The event arrives later than the DB commit, but it always arrives.
- Easy to add new sinks. A second relay can consume the same outbox and publish to a different sink without touching the service.
Realisations¶
| Realisation | Outbox storage | Relay mechanism |
|---|---|---|
| Polling table | Dedicated outbox table in same DB | Cron / worker polls, marks consumed |
| Debezium / CDC on outbox table | Dedicated outbox table | CDC connector (e.g. Kafka Connect) |
| Native change stream on primary table | The primary table is the outbox | Service-native stream consumer |
| Message-bus transactional write (rare) | DB + bus in 2PC | 2PC coordinator |
The native-stream realisation is the cleanest: no separate outbox table, no dual-write risk even in relay code, no cleanup job. Zalando's DynamoDB-native realisation uses DynamoDB Streams + AWS Lambda and canonicalises as patterns/dynamodb-streams-plus-lambda-outbox-relay.
Zalando also ships a Postgres variant as a platform offering managed via a central Kubernetes operator — named but not architecturally detailed in the post. The polling-or-CDC shape is the typical Postgres realisation (Debezium + Kafka Connect or a custom poller).
Fallback: DLQ for transient bus failures¶
Even with the outbox decoupled, the relay → bus publish step can fail. Robust designs add a dead-letter queue and a periodic re-drain:
- Relay publish with exponential-backoff retries.
- On exhaustion, event → DLQ.
- Periodic cron / worker drains the DLQ, re-publishing until the bus accepts.
Canonicalised as patterns/sqs-dlq-plus-cron-requeue in Zalando's implementation (Lambda retries → SQS DLQ → Kubernetes CronJob requeue).
Trade-offs¶
- Eventual consistency for consumers. The synchronous caller sees a 200 as soon as the DB transaction commits; downstream event consumers see the event some time later (relay latency + bus latency + any DLQ detour).
- At-least-once + possible out-of-order. Retries and DLQ requeue both produce duplicates; DLQ requeue also breaks global ordering. Consumers must be idempotent and tolerate reordering — see concepts/at-least-once-delivery.
- Outbox growth. If the relay falls behind, the outbox grows. Dedicated-table realisations need cleanup; native-stream realisations rely on the stream retention window (24h default for DynamoDB Streams).
- Relay cost. Polling realisations incur continuous DB load even when no changes are happening; stream-based realisations only run per change but still have per-invocation cost (Lambda fees etc.).
- Cannot be used when the caller needs the event before proceeding. By construction the event is async. Sagas and other multi-step workflows need additional coordination.
- Schema coupling between outbox and event. The relay has to know how to transform outbox rows into published events; any schema change requires a migration across both.
When to use¶
- Service owns a database and needs to emit events on every state change.
- The database is the source of truth; the event bus is a secondary sink.
- Downstream consumers can tolerate eventual, duplicated, possibly reordered delivery.
- Availability of the service is an SLO.
When NOT to use¶
- Request–response integration where the caller must see the event's effect before the response is returned — use a synchronous API, not events.
- Tight-latency event emission where "some time later" is not tolerable. Sub-second SLAs on event delivery under load are achievable with outbox but require careful sizing.
- No persistent store on the service. If the service is stateless, there's nothing to transactionally attach the outbox to.
- Strict per-key ordering required with DLQ. DLQ requeue breaks global ordering; some per-key schemes preserve it but require careful partitioning.
Seen in¶
- sources/2022-02-02-zalando-utilizing-amazon-dynamodb-and-aws-lambda-for-asynchronous-event-publication — The canonical wiki worked example. Zalando Payments's Order Store decouples its synchronous REST path from Nakadi event emission using DynamoDB + DynamoDB Streams (outbox is the primary table) + Lambda relay + SQS DLQ + Kubernetes CronJob fallback. The post walks through the four canonical steps with diagrams and motivates the redesign with the explicit 99.9% × 99.9% = 99.8% availability arithmetic. Zalando generalises the pattern as a platform offering across the company; a Postgres variant managed by a central Kubernetes operator already ships internally for teams whose primary store is Postgres.
Related¶
- patterns/dynamodb-streams-plus-lambda-outbox-relay — the DynamoDB-native realisation.
- patterns/sqs-dlq-plus-cron-requeue — the retry / fallback layer typical in production outbox implementations.
- concepts/availability-multiplication-of-dependencies — the problem this pattern solves.
- concepts/event-driven-architecture — the aggregate shape.
- concepts/eventual-consistency, concepts/at-least-once-delivery — the semantic costs.
- concepts/dynamodb-streams, concepts/change-data-capture — common relay substrates.
- systems/dynamodb, systems/aws-lambda, systems/nakadi
- companies/zalando