Skip to content

PATTERN Cited by 1 source

SQS DLQ + cron requeue

Definition

SQS DLQ + cron requeue is a two-layer retry pattern for best-effort event publication: an AWS Lambda (or equivalent consumer) first performs in-process retries with exponential backoff; if the bus is still refusing the event after the retry ladder exhausts, the event is parked in an SQS dead-letter queue; a Kubernetes CronJob (or equivalent long-running worker) periodically drains the DLQ, re-running the same publication code until the target bus accepts.

The effect: zero event loss for publication targets that can have transient outages or rate-limit episodes, bounded latency impact on the happy path (retries don't block the stream), and durable persistence of stuck events via SQS.

Shape

Primary relay (Lambda / consumer)
   │ publish → target bus
   │ retry (exp. backoff)
   │ retry...
   │ retry...
   │ retries exhausted
SQS dead-letter queue  ◀──── durable fallback storage
   │ (idle while bus is unhealthy)
Kubernetes CronJob (every N min)
   │ read message
   │ publish → target bus
   │ on success → delete from SQS
   │ on failure → leave in SQS for next cron tick

Why two layers

The in-process retry loop handles brief transient failures — momentary bus latency spikes, jittery network, leader elections — without paying the DLQ-round-trip cost or losing place in the stream. The DLQ + cron handles longer outages — minutes to hours of target-bus unhappiness — where continuing to retry in-process would either block the stream (and grow iterator age on DynamoDB Streams) or drop the event entirely.

The cron interval becomes the knob: shorter interval = faster recovery after the bus heals; longer interval = lower steady-state cost. Zalando does not cite a specific interval in the post but describes the mechanism explicitly.

The "same code on two substrates" property

A subtle win: the CronJob runs the same publication code as the Lambda. From Zalando's post:

"In order to retry sending the events in the queue in intervals we created a Kubernetes cronjob. The cronjob simply runs the Python code that is also run by the AWS Lambda and tries to publish the events to Nakadi again." (Source: sources/2022-02-02-zalando-utilizing-amazon-dynamodb-and-aws-lambda-for-asynchronous-event-publication)

This matters because:

  • No drift between fast path and slow path. Bug fixed in the Lambda is automatically fixed in the cron drain; same code, two triggers.
  • Single schema knowledge. The event assembly logic lives in one place.
  • Testing surface is smaller. Unit-test the publish function; it's correct in both paths.

This design choice is one of the pattern's load-bearing advantages — a common anti-pattern is to hand-roll the DLQ drain in a different runtime / language and have the two diverge.

Built-in DLQ coupling

AWS Lambda's event-source integration provides a DLQ slot out of the box: "when creating a new AWS Lambda function it already comes with an AWS SQS queue attached as a dead letter queue." No hand-wiring of the durability layer — the primitive is built into the runtime's failure semantics.

Trade-offs

  • At-least-once, with duplication on both layers. The Lambda's retries may succeed after an ack was lost, producing duplicates. The cron drain re-publishes events that may have already been accepted on an earlier pass. Consumers must be idempotent. See concepts/at-least-once-delivery.
  • Ordering is broken. DLQ-drained events arrive out of order relative to events that succeeded first try. Per-key consumers needing strict order must reconstruct it from payload sequence numbers.
  • Cron interval bounds recovery latency. A stuck event sits in SQS until the next cron tick + successful publish — N minutes minimum. Higher cron frequency = better recovery, more cost.
  • SQS visibility-timeout sizing. The cron worker must complete publication within the visibility timeout, or the message re-delivers while in flight → extra duplicates.
  • Parallel drain can amplify load. If the cron worker is multi-replica, ensure deduplication so the bus isn't hit with concurrent re-publications of the same event.
  • Poison-pill events sit forever. An event the bus will never accept (bad schema, oversized payload) loops in the DLQ until humans drain it. A secondary monitor on DLQ depth + per-message age is advisable.

When to use

  • The primary path is a serverless or stream-driven consumer publishing to a best-effort bus with transient unavailability.
  • Zero event loss is required, but sub-minute delivery latency is not.
  • Out-of-order delivery is acceptable.
  • The target bus's failure modes are transient, not persistent (schema rejections should be handled differently).

When NOT to use

  • Strict ordering required. DLQ requeue fundamentally breaks ordering.
  • Sub-minute latency SLO on event delivery. Cron-driven drain adds interval-sized delay.
  • Target bus can reject forever. Poison-pill events will loop indefinitely; need a separate discard path.
  • Primary path is sync request/response. DLQ+cron is for async; sync callers can't wait for a cron cycle.

Seen in

  • sources/2022-02-02-zalando-utilizing-amazon-dynamodb-and-aws-lambda-for-asynchronous-event-publicationCanonical worked example. Zalando Payments's Order Store relay publishes to Nakadi; in-process Lambda retries with exponential backoff handle transient failures, an attached SQS DLQ stores events that exhaust retries, and a Kubernetes CronJob runs the same Python publication code on an interval, draining the DLQ until Nakadi accepts. The post explicitly names the same-code-on-two-substrates property and acknowledges the resulting out-of-order delivery as an accepted trade-off.
Last updated · 550 distilled / 1,221 read