Skip to content

ZALANDO 2022-02-02

Read original ↗

Zalando — Utilizing Amazon DynamoDB and AWS Lambda for Asynchronous Event Publication

Summary

Zalando Payments's Order Store service stores payment-related order data in a DynamoDB table and needs to emit data-change events to Nakadi (Zalando's Kafka-backed event bus) whenever that data is mutated. The naïve coupled design — write to DynamoDB, then POST to Nakadi in the same request — makes request availability the product of both downstream availabilities (99.9% × 99.9% = 99.8%). Zalando decouples event publication from the synchronous write path using the Transactional Outbox pattern, implemented concretely via DynamoDB Streams + AWS Lambda as a managed message relay, with an AWS SQS dead-letter queue + Kubernetes CronJob as the retry fallback for transient Nakadi failures. The result: the service's availability now depends only on DynamoDB (one dependency, not two); event publication becomes eventually consistent; Zalando is generalising the pattern as a platform offering to all teams (they already ship a PostgreSQL version via a central Kubernetes operator).

Key takeaways

  1. Dependency count multiplies unavailability. With DynamoDB and Nakadi each at 99.9%, a service that depends on both synchronously caps at 99.9% × 99.9% = 99.8%. Removing the Nakadi dependency from the critical path restores the ceiling to 99.9%. Canonicalised as concepts/availability-multiplication-of-dependencies.
  2. Transactional Outbox decouples durable change from event emission. The service commits state to the data store and an outbox entry in one transaction; a separate message relay reads the outbox and publishes the corresponding event asynchronously. Zalando's implementation formalises the pattern with four steps: (1) change entry, (1.5) populate outbox in the same transaction, (2) message relay consumes outbox, (2.5) relay publishes event and marks entry consumed.
  3. DynamoDB Streams + AWS Lambda is the cloud-native relay. DynamoDB Streams is a built-in change-data-capture feed — for every added/updated/deleted item, a dataset containing the old image and new image is emitted (concepts/dynamodb-streams). An AWS Lambda function subscribed to the stream acts as the message relay: it assembles the Nakadi event (full post-change item + JSON patch diff between old and new image) and publishes it. The outbox and the primary table are the same DynamoDB table — the stream is the outbox. Canonicalised as patterns/dynamodb-streams-plus-lambda-outbox-relay.
  4. SQS DLQ + Kubernetes CronJob handles Nakadi transient failures. When the Lambda's retries (with exponential backoff) exhaust and publication still fails, the event is sent to the Lambda's attached SQS dead-letter queue. A Kubernetes CronJob runs the same Python publication code on an interval, draining the DLQ until Nakadi accepts the event, at which point the entry is removed from SQS. Canonicalised as patterns/sqs-dlq-plus-cron-requeue.
  5. Language choice: Python over Java for the relay. Zalando chose Python for the Lambda implementation citing lightweight-ness vs Java — implicit nod to cold-start cost on serverless runtimes and to minimising per-invocation overhead for a stream-driven relay with unpredictable traffic.
  6. Events are not ordered. The post explicitly notes: "we do not guarantee that the events are published in the correct order" — a consequence of retries and DLQ re-publication. Consumers must tolerate out-of-order deliveries (or assemble their own order via sequence numbers / timestamps). This is the trade-off accepted in exchange for the availability gain.
  7. Outbox as platform primitive. Payments built this in-team first, and then "we are working with our infrastructure teams to offer an implementation of this pattern to all teams at Zalando". Zalando already ships a PostgreSQL Transactional Outbox as a platform offering, managed centrally via a Kubernetes operator. Treating the outbox pattern as a platform product (not per-team code) is the expected next step once one team has validated it.

Operational / architectural numbers

  • Pre-decoupling availability ceiling: 99.9% × 99.9% = 99.8% (one of many 0.1%-worse SLOs created by cascading dependencies).
  • Post-decoupling availability ceiling: 99.9% (single DynamoDB dependency; Nakadi unavailability no longer blocks writes).
  • Dependencies on sync path: reduced from 2 → 1.
  • Steps in outbox flow: 4 canonical (change entry, populate outbox, consume outbox, publish event).

Systems / concepts / patterns extracted

Systems

  • Amazon DynamoDB — primary data store for Order Store's payment data; the same table serves as the outbox via its stream.
  • DynamoDB Streams — change data capture feed, configured to expose both old image and new image of each mutated item.
  • AWS Lambda — Python message relay triggered by each DynamoDB Stream record; assembles and publishes the Nakadi event.
  • Nakadi — Zalando's central event bus (Kafka-backed); the publication target for data-change events.
  • AWS SQS — attached as the Lambda's dead-letter queue; holds events whose publication retries have exhausted.
  • Kubernetes — runs the CronJob that re-processes DLQ-parked events.

Concepts

Patterns

Caveats & trade-offs

  • No ordering guarantee. Out-of-order deliveries are explicitly accepted. Consumers that need strict per-key ordering need to reconstruct order themselves (e.g., from per-item monotonic sequence numbers in the payload).
  • Eventual consistency. A synchronous REST caller sees a 200 as soon as DynamoDB commits; downstream consumers see the event some time later (Lambda invocation + Nakadi publish
  • possibly SQS DLQ + CronJob cycle). Callers that need the event before proceeding cannot use this pattern as-is.
  • Single-region, single-AZ failure mode implicit. The article is focused on dependency decoupling, not multi-region DR — that's a layer above this design.
  • Cost of always-on Lambda + SQS. The relay runs per-change; for very high-write workloads the Lambda invocation fee compounds. Not discussed in the article; worth evaluating per team.
  • Python cold-start tolerance. Choosing Python citing lightweight-ness is sensible for this workload but may not generalise — JVM services with heavy dependencies on classpath warmup would weigh differently; see cold-start.
  • Out-of-order events compound with DLQ requeue. Every DLQ-requeued event arrives out of order with respect to events that succeeded on the first try, compounding the reordering surface area. Consumers must be robust to this.

Source

Last updated · 501 distilled / 1,218 read