Lessons learned from scaling to 1 million Lambda functions¶
Summary¶
ProGlove, a wearable barcode scanning SaaS company, shares their journey scaling a fully serverless, multi-account AWS platform from zero to over one million Lambda functions across thousands of tenant accounts. The article covers six growth phases revealing lessons about true scale-to-zero economics, quota isolation, self-DDoS from synchronized schedules, observability cost amplification, deployment tooling ceilings (CloudFormation StackSets), and the architectural rethinking required when "idle" resources still cost money at scale. Key insight: at extreme multi-tenant scale, efficiency must scale faster than growth — operational concerns shift from capacity to per-unit cost control.
Key takeaways¶
-
Account-per-tenant isolation eliminates noisy-neighbor quota exhaustion. Each AWS account gets its own Lambda concurrency limit, API Gateway throttles, and service quotas — a single tenant's burst cannot cascade across the fleet. (Phase 2)
-
Synchronized schedules cause self-DDoS. When thousands of functions use
rate(5 minutes)aligned to the same clock second, the aggregate burst resembles a coordinated attack on internal APIs. Fix: a standardized library enforcing jitter, randomized offsets, and staggered execution. Rule of thumb: "Never do the same thing at the same time everywhere." (Phase 3) -
Observability costs can exceed compute costs. At $3/account/month forwarding CloudWatch logs/metrics to a third-party platform, observability nearly doubled the total cloud bill at thousands of accounts. After aggressive optimization (priority-based data routing, idle-account monitoring reduction), cost fell to ~$0.70/account. (Phase 3–4)
-
SQS polling is anti-scale-to-zero. Lambda continuously polls SQS even when no messages exist, generating costs at scale. ProGlove removed SQS from the EventBridge → Lambda path, relying on
AsyncEventsDroppedandConcurrentExecutionsmetrics for safety, plus a centralized DLQ for failure recovery — trading individual queue resilience for fleet-wide cost efficiency. (Phase 4) -
Centralized DLQ introduces an isolation trade-off. Routing failures from all tenants to a single recovery queue requires "extreme discipline" to preserve data isolation — the tenant boundary is the AWS account ID embedded in the event. Moved from silo to bridged model. (Phase 4)
-
CloudFormation StackSets hit a ceiling at extreme scale. At 1M+ functions, StackSets produced occasional errors that compounded, plus performance bottlenecks. Rather than building a custom replacement, ProGlove engaged the CloudFormation team to influence the roadmap, and built a Step Functions–based deployment-tracking service to aggregate events and retry failures. (Phase 5)
-
Automated account provisioning: <15 min from request to ready. Step Functions orchestrates the full lifecycle: create account → apply SCPs → bootstrap IAM → trigger initial StackSet deployment → ready. Near-zero incremental cost per provisioning run. (Phase 2)
-
Mono-repo for 20+ microservices enforces consistency. Single CI/CD chain, uniform security scanning across >1M functions, coordinated runtime upgrades. The choice was made to reduce governance cost at scale, not for developer convenience. (Phase 6)
-
"Almost-zero" is the real floor. Even with aggressive scale-to-zero, monitoring (CloudWatch Alarms) and observability tooling create a non-zero idle cost. Optimized to <$1/month per inactive account. (Phase 6)
-
Efficiency > capacity. The concluding architectural principle: "The only way to stay ahead is to make sure efficiency scales faster than growth." (Conclusion)
Architecture highlights¶
- Microservice structure: 5–15 Lambda functions per service, coordinated by Step Functions, EventBridge for routing, DynamoDB as primary store, packaged in a single CloudFormation stack.
- Deployment: CloudFormation StackSets from central management account → parallel multi-account updates.
- Account factory: Step Functions in management account → Organizations API → SCP application → IAM bootstrap → StackSet target registration.
- Observability: CloudWatch → third-party platform (cross-account forwarding); optimized via priority-based data segregation.
Operational numbers¶
| Metric | Value |
|---|---|
| Lambda functions in production | >1,000,000 |
| Tenant AWS accounts | thousands |
| Microservices | 20 (mono-repo) |
| Functions per microservice | 5–15 |
| Account provisioning time | <15 minutes |
| Observability cost (before) | ~$3/account/month |
| Observability cost (after) | ~$0.70/account/month |
| Idle account cost (after) | <$1/month |
Caveats¶
- The article is written from a customer (ProGlove) perspective, not an AWS internal architecture post. It represents a case study of serverless at scale, not Lambda's own design.
- Specific optimizations (removing SQS, centralized DLQ) trade resilience for cost — not universally applicable.
- Numbers reflect ProGlove's specific workload (IoT barcode scanning, bursty event-driven); YMMV for continuous-compute workloads.
Source¶
- Original: https://aws.amazon.com/blogs/architecture/lessons-learned-from-scaling-to-1-million-lambda-functions/
- Raw markdown:
raw/aws/2026-06-29-lessons-learned-from-scaling-to-1-million-lambda-functions-fcef8b74.md
Related¶
- sources/2026-02-25-aws-6000-accounts-three-people-one-platform — same company (ProGlove), earlier post focused on multi-account architecture
- concepts/scale-to-zero — central thesis of the article
- concepts/noisy-neighbor — quota isolation eliminates this failure mode
- concepts/thundering-herd — self-DDoS is a thundering-herd variant
- patterns/account-per-tenant — the foundational isolation pattern
- patterns/request-scattering — solution to synchronised-schedule DDoS
- patterns/centralized-dlq — cost-optimised failure recovery