Skip to content

AWS 2026-06-29

Read original ↗

Lessons learned from scaling to 1 million Lambda functions

Summary

ProGlove, a wearable barcode scanning SaaS company, shares their journey scaling a fully serverless, multi-account AWS platform from zero to over one million Lambda functions across thousands of tenant accounts. The article covers six growth phases revealing lessons about true scale-to-zero economics, quota isolation, self-DDoS from synchronized schedules, observability cost amplification, deployment tooling ceilings (CloudFormation StackSets), and the architectural rethinking required when "idle" resources still cost money at scale. Key insight: at extreme multi-tenant scale, efficiency must scale faster than growth — operational concerns shift from capacity to per-unit cost control.

Key takeaways

  1. Account-per-tenant isolation eliminates noisy-neighbor quota exhaustion. Each AWS account gets its own Lambda concurrency limit, API Gateway throttles, and service quotas — a single tenant's burst cannot cascade across the fleet. (Phase 2)

  2. Synchronized schedules cause self-DDoS. When thousands of functions use rate(5 minutes) aligned to the same clock second, the aggregate burst resembles a coordinated attack on internal APIs. Fix: a standardized library enforcing jitter, randomized offsets, and staggered execution. Rule of thumb: "Never do the same thing at the same time everywhere." (Phase 3)

  3. Observability costs can exceed compute costs. At $3/account/month forwarding CloudWatch logs/metrics to a third-party platform, observability nearly doubled the total cloud bill at thousands of accounts. After aggressive optimization (priority-based data routing, idle-account monitoring reduction), cost fell to ~$0.70/account. (Phase 3–4)

  4. SQS polling is anti-scale-to-zero. Lambda continuously polls SQS even when no messages exist, generating costs at scale. ProGlove removed SQS from the EventBridge → Lambda path, relying on AsyncEventsDropped and ConcurrentExecutions metrics for safety, plus a centralized DLQ for failure recovery — trading individual queue resilience for fleet-wide cost efficiency. (Phase 4)

  5. Centralized DLQ introduces an isolation trade-off. Routing failures from all tenants to a single recovery queue requires "extreme discipline" to preserve data isolation — the tenant boundary is the AWS account ID embedded in the event. Moved from silo to bridged model. (Phase 4)

  6. CloudFormation StackSets hit a ceiling at extreme scale. At 1M+ functions, StackSets produced occasional errors that compounded, plus performance bottlenecks. Rather than building a custom replacement, ProGlove engaged the CloudFormation team to influence the roadmap, and built a Step Functions–based deployment-tracking service to aggregate events and retry failures. (Phase 5)

  7. Automated account provisioning: <15 min from request to ready. Step Functions orchestrates the full lifecycle: create account → apply SCPs → bootstrap IAM → trigger initial StackSet deployment → ready. Near-zero incremental cost per provisioning run. (Phase 2)

  8. Mono-repo for 20+ microservices enforces consistency. Single CI/CD chain, uniform security scanning across >1M functions, coordinated runtime upgrades. The choice was made to reduce governance cost at scale, not for developer convenience. (Phase 6)

  9. "Almost-zero" is the real floor. Even with aggressive scale-to-zero, monitoring (CloudWatch Alarms) and observability tooling create a non-zero idle cost. Optimized to <$1/month per inactive account. (Phase 6)

  10. Efficiency > capacity. The concluding architectural principle: "The only way to stay ahead is to make sure efficiency scales faster than growth." (Conclusion)

Architecture highlights

  • Microservice structure: 5–15 Lambda functions per service, coordinated by Step Functions, EventBridge for routing, DynamoDB as primary store, packaged in a single CloudFormation stack.
  • Deployment: CloudFormation StackSets from central management account → parallel multi-account updates.
  • Account factory: Step Functions in management account → Organizations API → SCP application → IAM bootstrap → StackSet target registration.
  • Observability: CloudWatch → third-party platform (cross-account forwarding); optimized via priority-based data segregation.

Operational numbers

Metric Value
Lambda functions in production >1,000,000
Tenant AWS accounts thousands
Microservices 20 (mono-repo)
Functions per microservice 5–15
Account provisioning time <15 minutes
Observability cost (before) ~$3/account/month
Observability cost (after) ~$0.70/account/month
Idle account cost (after) <$1/month

Caveats

  • The article is written from a customer (ProGlove) perspective, not an AWS internal architecture post. It represents a case study of serverless at scale, not Lambda's own design.
  • Specific optimizations (removing SQS, centralized DLQ) trade resilience for cost — not universally applicable.
  • Numbers reflect ProGlove's specific workload (IoT barcode scanning, bursty event-driven); YMMV for continuous-compute workloads.

Source

Last updated · 562 distilled / 1,660 read