Skip to content

PATTERN Cited by 1 source

Centralized DLQ

Intent

Reduce the per-account polling cost of individual dead-letter queues in a multi-account serverless architecture by routing all failure events to a single centralized recovery queue.

Problem

In an account-per-tenant model with thousands of accounts, each account's individual DLQ requires continuous polling — generating cost even when empty. At scale, the aggregate polling cost of idle DLQs reintroduces the same "idle costs at scale" problem that scale-to-zero was meant to solve.

"Polling individual Dead Letter Queues (DLQs) in every account reintroduced the same polling cost issues." (Source: sources/2026-06-29-aws-lessons-learned-from-scaling-to-1-million-lambda-functions)

Solution

Route failures from all tenant accounts to a single centralized DLQ. Process failures in one place rather than polling thousands of individual queues.

Structure

  1. Failure event source: Lambda async invocation failures, EventBridge delivery failures.
  2. Central routing: Failed events are forwarded (via EventBridge cross-account rules or Lambda destinations) to a centralized recovery queue in a management/operations account.
  3. Recovery processor: A single service processes the central DLQ, correlating events back to tenant accounts via embedded account IDs.

Isolation trade-off

This pattern deliberately weakens tenant isolation at the failure-recovery layer: - Events from different tenants converge in a single queue. - The tenant boundary is preserved via the AWS account ID embedded in the event payload. - Requires "extreme discipline" to ensure the recovery processor doesn't accidentally cross tenant boundaries.

The SQS usage model moves from a silo (one queue per tenant) to a bridged model where the account ID serves as the logical tenant identifier.

When to use

  • Multi-account serverless architectures at scale (hundreds+ of accounts).
  • When per-account DLQ polling costs exceed the value of per-tenant failure isolation.
  • When failure rates are low (most DLQs are empty most of the time).

When NOT to use

  • Compliance requirements mandate strict physical data isolation (e.g., regulated industries).
  • Failure rates are high enough that the central DLQ becomes a bottleneck.
  • Tenant data in the event payload requires encryption-at-rest with per-tenant keys.

Seen in

Last updated · 562 distilled / 1,671 read