PATTERN Cited by 1 source

Centralized DLQ¶

Intent¶

Reduce the per-account polling cost of individual dead-letter queues in a multi-account serverless architecture by routing all failure events to a single centralized recovery queue.

Problem¶

In an account-per-tenant model with thousands of accounts, each account's individual DLQ requires continuous polling — generating cost even when empty. At scale, the aggregate polling cost of idle DLQs reintroduces the same "idle costs at scale" problem that scale-to-zero was meant to solve.

"Polling individual Dead Letter Queues (DLQs) in every account reintroduced the same polling cost issues." (Source: sources/2026-06-29-aws-lessons-learned-from-scaling-to-1-million-lambda-functions)

Solution¶

Route failures from all tenant accounts to a single centralized DLQ. Process failures in one place rather than polling thousands of individual queues.

Structure¶

Failure event source: Lambda async invocation failures, EventBridge delivery failures.
Central routing: Failed events are forwarded (via EventBridge cross-account rules or Lambda destinations) to a centralized recovery queue in a management/operations account.
Recovery processor: A single service processes the central DLQ, correlating events back to tenant accounts via embedded account IDs.

Isolation trade-off¶

This pattern deliberately weakens tenant isolation at the failure-recovery layer: - Events from different tenants converge in a single queue. - The tenant boundary is preserved via the AWS account ID embedded in the event payload. - Requires "extreme discipline" to ensure the recovery processor doesn't accidentally cross tenant boundaries.

The SQS usage model moves from a silo (one queue per tenant) to a bridged model where the account ID serves as the logical tenant identifier.

When to use¶

Multi-account serverless architectures at scale (hundreds+ of accounts).
When per-account DLQ polling costs exceed the value of per-tenant failure isolation.
When failure rates are low (most DLQs are empty most of the time).

When NOT to use¶

Compliance requirements mandate strict physical data isolation (e.g., regulated industries).
Failure rates are high enough that the central DLQ becomes a bottleneck.
Tenant data in the event payload requires encryption-at-rest with per-tenant keys.

Seen in¶

sources/2026-06-29-aws-lessons-learned-from-scaling-to-1-million-lambda-functions — ProGlove's centralized DLQ approach replacing per-account polling at 1M Lambda function scale.