PATTERN Cited by 1 source

SQS DLQ + cron requeue¶

Definition¶

SQS DLQ + cron requeue is a two-layer retry pattern for best-effort event publication: an AWS Lambda (or equivalent consumer) first performs in-process retries with exponential backoff; if the bus is still refusing the event after the retry ladder exhausts, the event is parked in an SQS dead-letter queue; a Kubernetes CronJob (or equivalent long-running worker) periodically drains the DLQ, re-running the same publication code until the target bus accepts.

The effect: zero event loss for publication targets that can have transient outages or rate-limit episodes, bounded latency impact on the happy path (retries don't block the stream), and durable persistence of stuck events via SQS.

Shape¶

Primary relay (Lambda / consumer)
   │
   │ publish → target bus
   │ retry (exp. backoff)
   │ retry...
   │ retry...
   │ retries exhausted
   ▼
SQS dead-letter queue  ◀──── durable fallback storage
   │
   │ (idle while bus is unhealthy)
   │
   ▼
Kubernetes CronJob (every N min)
   │
   │ read message
   │ publish → target bus
   │ on success → delete from SQS
   │ on failure → leave in SQS for next cron tick

Why two layers¶

The in-process retry loop handles brief transient failures — momentary bus latency spikes, jittery network, leader elections — without paying the DLQ-round-trip cost or losing place in the stream. The DLQ + cron handles longer outages — minutes to hours of target-bus unhappiness — where continuing to retry in-process would either block the stream (and grow iterator age on DynamoDB Streams) or drop the event entirely.

The cron interval becomes the knob: shorter interval = faster recovery after the bus heals; longer interval = lower steady-state cost. Zalando does not cite a specific interval in the post but describes the mechanism explicitly.

The "same code on two substrates" property¶

A subtle win: the CronJob runs the same publication code as the Lambda. From Zalando's post:

"In order to retry sending the events in the queue in intervals we created a Kubernetes cronjob. The cronjob simply runs the Python code that is also run by the AWS Lambda and tries to publish the events to Nakadi again." (Source: sources/2022-02-02-zalando-utilizing-amazon-dynamodb-and-aws-lambda-for-asynchronous-event-publication)

This matters because:

No drift between fast path and slow path. Bug fixed in the Lambda is automatically fixed in the cron drain; same code, two triggers.
Single schema knowledge. The event assembly logic lives in one place.
Testing surface is smaller. Unit-test the publish function; it's correct in both paths.

This design choice is one of the pattern's load-bearing advantages — a common anti-pattern is to hand-roll the DLQ drain in a different runtime / language and have the two diverge.

Built-in DLQ coupling¶

AWS Lambda's event-source integration provides a DLQ slot out of the box: "when creating a new AWS Lambda function it already comes with an AWS SQS queue attached as a dead letter queue." No hand-wiring of the durability layer — the primitive is built into the runtime's failure semantics.

Trade-offs¶

At-least-once, with duplication on both layers. The Lambda's retries may succeed after an ack was lost, producing duplicates. The cron drain re-publishes events that may have already been accepted on an earlier pass. Consumers must be idempotent. See concepts/at-least-once-delivery.
Ordering is broken. DLQ-drained events arrive out of order relative to events that succeeded first try. Per-key consumers needing strict order must reconstruct it from payload sequence numbers.
Cron interval bounds recovery latency. A stuck event sits in SQS until the next cron tick + successful publish — N minutes minimum. Higher cron frequency = better recovery, more cost.
SQS visibility-timeout sizing. The cron worker must complete publication within the visibility timeout, or the message re-delivers while in flight → extra duplicates.
Parallel drain can amplify load. If the cron worker is multi-replica, ensure deduplication so the bus isn't hit with concurrent re-publications of the same event.
Poison-pill events sit forever. An event the bus will never accept (bad schema, oversized payload) loops in the DLQ until humans drain it. A secondary monitor on DLQ depth + per-message age is advisable.

When to use¶

The primary path is a serverless or stream-driven consumer publishing to a best-effort bus with transient unavailability.
Zero event loss is required, but sub-minute delivery latency is not.
Out-of-order delivery is acceptable.
The target bus's failure modes are transient, not persistent (schema rejections should be handled differently).

When NOT to use¶

Strict ordering required. DLQ requeue fundamentally breaks ordering.
Sub-minute latency SLO on event delivery. Cron-driven drain adds interval-sized delay.
Target bus can reject forever. Poison-pill events will loop indefinitely; need a separate discard path.
Primary path is sync request/response. DLQ+cron is for async; sync callers can't wait for a cron cycle.

Seen in¶

sources/2022-02-02-zalando-utilizing-amazon-dynamodb-and-aws-lambda-for-asynchronous-event-publication — Canonical worked example. Zalando Payments's Order Store relay publishes to Nakadi; in-process Lambda retries with exponential backoff handle transient failures, an attached SQS DLQ stores events that exhaust retries, and a Kubernetes CronJob runs the same Python publication code on an interval, draining the DLQ until Nakadi accepts. The post explicitly names the same-code-on-two-substrates property and acknowledges the resulting out-of-order delivery as an accepted trade-off.

patterns/transactional-outbox — the pattern this most often sits inside of.
patterns/dynamodb-streams-plus-lambda-outbox-relay — the primary-path realisation Zalando pairs this with.
concepts/at-least-once-delivery — the semantic consequence.
concepts/event-driven-architecture — the aggregate shape.
systems/aws-sqs, systems/aws-lambda, systems/kubernetes, systems/nakadi
companies/zalando