AWS 2026-06-29

Lessons learned from scaling to 1 million Lambda functions¶

Summary¶

ProGlove, a wearable barcode scanning SaaS company, shares their journey scaling a fully serverless, multi-account AWS platform from zero to over one million Lambda functions across thousands of tenant accounts. The article covers six growth phases revealing lessons about true scale-to-zero economics, quota isolation, self-DDoS from synchronized schedules, observability cost amplification, deployment tooling ceilings (CloudFormation StackSets), and the architectural rethinking required when "idle" resources still cost money at scale. Key insight: at extreme multi-tenant scale, efficiency must scale faster than growth — operational concerns shift from capacity to per-unit cost control.

Key takeaways¶

Account-per-tenant isolation eliminates noisy-neighbor quota exhaustion. Each AWS account gets its own Lambda concurrency limit, API Gateway throttles, and service quotas — a single tenant's burst cannot cascade across the fleet. (Phase 2)
Synchronized schedules cause self-DDoS. When thousands of functions use rate(5 minutes) aligned to the same clock second, the aggregate burst resembles a coordinated attack on internal APIs. Fix: a standardized library enforcing jitter, randomized offsets, and staggered execution. Rule of thumb: "Never do the same thing at the same time everywhere." (Phase 3)
Observability costs can exceed compute costs. At $3/account/month forwarding CloudWatch logs/metrics to a third-party platform, observability nearly doubled the total cloud bill at thousands of accounts. After aggressive optimization (priority-based data routing, idle-account monitoring reduction), cost fell to ~$0.70/account. (Phase 3–4)
SQS polling is anti-scale-to-zero. Lambda continuously polls SQS even when no messages exist, generating costs at scale. ProGlove removed SQS from the EventBridge → Lambda path, relying on AsyncEventsDropped and ConcurrentExecutions metrics for safety, plus a centralized DLQ for failure recovery — trading individual queue resilience for fleet-wide cost efficiency. (Phase 4)
Centralized DLQ introduces an isolation trade-off. Routing failures from all tenants to a single recovery queue requires "extreme discipline" to preserve data isolation — the tenant boundary is the AWS account ID embedded in the event. Moved from silo to bridged model. (Phase 4)
CloudFormation StackSets hit a ceiling at extreme scale. At 1M+ functions, StackSets produced occasional errors that compounded, plus performance bottlenecks. Rather than building a custom replacement, ProGlove engaged the CloudFormation team to influence the roadmap, and built a Step Functions–based deployment-tracking service to aggregate events and retry failures. (Phase 5)
Automated account provisioning: <15 min from request to ready. Step Functions orchestrates the full lifecycle: create account → apply SCPs → bootstrap IAM → trigger initial StackSet deployment → ready. Near-zero incremental cost per provisioning run. (Phase 2)
Mono-repo for 20+ microservices enforces consistency. Single CI/CD chain, uniform security scanning across >1M functions, coordinated runtime upgrades. The choice was made to reduce governance cost at scale, not for developer convenience. (Phase 6)
"Almost-zero" is the real floor. Even with aggressive scale-to-zero, monitoring (CloudWatch Alarms) and observability tooling create a non-zero idle cost. Optimized to <$1/month per inactive account. (Phase 6)
Efficiency > capacity. The concluding architectural principle: "The only way to stay ahead is to make sure efficiency scales faster than growth." (Conclusion)

Architecture highlights¶

Microservice structure: 5–15 Lambda functions per service, coordinated by Step Functions, EventBridge for routing, DynamoDB as primary store, packaged in a single CloudFormation stack.
Deployment: CloudFormation StackSets from central management account → parallel multi-account updates.
Account factory: Step Functions in management account → Organizations API → SCP application → IAM bootstrap → StackSet target registration.
Observability: CloudWatch → third-party platform (cross-account forwarding); optimized via priority-based data segregation.

Operational numbers¶

Metric	Value
Lambda functions in production	>1,000,000
Tenant AWS accounts	thousands
Microservices	20 (mono-repo)
Functions per microservice	5–15
Account provisioning time	<15 minutes
Observability cost (before)	~$3/account/month
Observability cost (after)	~$0.70/account/month
Idle account cost (after)	<$1/month

Caveats¶

The article is written from a customer (ProGlove) perspective, not an AWS internal architecture post. It represents a case study of serverless at scale, not Lambda's own design.
Specific optimizations (removing SQS, centralized DLQ) trade resilience for cost — not universally applicable.
Numbers reflect ProGlove's specific workload (IoT barcode scanning, bursty event-driven); YMMV for continuous-compute workloads.

Source¶

sources/2026-02-25-aws-6000-accounts-three-people-one-platform — same company (ProGlove), earlier post focused on multi-account architecture
concepts/scale-to-zero — central thesis of the article
concepts/noisy-neighbor — quota isolation eliminates this failure mode
concepts/thundering-herd — self-DDoS is a thundering-herd variant
patterns/account-per-tenant — the foundational isolation pattern
patterns/request-scattering — solution to synchronised-schedule DDoS
patterns/centralized-dlq — cost-optimised failure recovery