PATTERN Cited by 1 source
Centralized DLQ¶
Intent¶
Reduce the per-account polling cost of individual dead-letter queues in a multi-account serverless architecture by routing all failure events to a single centralized recovery queue.
Problem¶
In an account-per-tenant model with thousands of accounts, each account's individual DLQ requires continuous polling — generating cost even when empty. At scale, the aggregate polling cost of idle DLQs reintroduces the same "idle costs at scale" problem that scale-to-zero was meant to solve.
"Polling individual Dead Letter Queues (DLQs) in every account reintroduced the same polling cost issues." (Source: sources/2026-06-29-aws-lessons-learned-from-scaling-to-1-million-lambda-functions)
Solution¶
Route failures from all tenant accounts to a single centralized DLQ. Process failures in one place rather than polling thousands of individual queues.
Structure¶
- Failure event source: Lambda async invocation failures, EventBridge delivery failures.
- Central routing: Failed events are forwarded (via EventBridge cross-account rules or Lambda destinations) to a centralized recovery queue in a management/operations account.
- Recovery processor: A single service processes the central DLQ, correlating events back to tenant accounts via embedded account IDs.
Isolation trade-off¶
This pattern deliberately weakens tenant isolation at the failure-recovery layer: - Events from different tenants converge in a single queue. - The tenant boundary is preserved via the AWS account ID embedded in the event payload. - Requires "extreme discipline" to ensure the recovery processor doesn't accidentally cross tenant boundaries.
The SQS usage model moves from a silo (one queue per tenant) to a bridged model where the account ID serves as the logical tenant identifier.
When to use¶
- Multi-account serverless architectures at scale (hundreds+ of accounts).
- When per-account DLQ polling costs exceed the value of per-tenant failure isolation.
- When failure rates are low (most DLQs are empty most of the time).
When NOT to use¶
- Compliance requirements mandate strict physical data isolation (e.g., regulated industries).
- Failure rates are high enough that the central DLQ becomes a bottleneck.
- Tenant data in the event payload requires encryption-at-rest with per-tenant keys.
Seen in¶
- sources/2026-06-29-aws-lessons-learned-from-scaling-to-1-million-lambda-functions — ProGlove's centralized DLQ approach replacing per-account polling at 1M Lambda function scale.