6,000 AWS accounts, three people, one platform: Lessons learned (AWS Architecture Blog, 2026-02-25)¶
Summary¶
ProGlove (smart-wearable barcode scanners for frontline workers) runs its Insight SaaS platform on AWS in an account-per-tenant model: every tenant gets a dedicated AWS account, and the full set of microservices is deployed into that account. At the time of writing, that means ~6,000 production AWS accounts operated by a three-person platform team, translating into >120,000 deployed service instances and ~1,000,000 Lambda functions in production. The post is the trade-off retrospective: why the team chose the most extreme end of the multi-account spectrum, what operational mechanisms (AWS Organizations + SCPs + StackSets + Step-Functions-orchestrated account creation + central telemetry + serverless-first service mix) make it feasible with a constant-size ops team, and where the model still forces custom tooling because native patterns haven't caught up.
Key takeaways¶
-
Account boundary = cheapest strong isolation AWS offers. "At scale, the AWS account boundary is the easiest way to implement isolation. Accounts are fully isolated containers for compute, storage, networking and more, with no shared scope unless you explicitly configure it." ProGlove took the multi-account strategy "to its logical extreme: every tenant gets their own AWS account." Named benefits: blast-radius containment, strong isolation (no shared storage/compute/perms), simplified developer mental model (one service instance belongs to exactly one tenant — no multi-tenancy code paths), per-tenant customization (toggle premium features / migrate individual tenants independently), and transparent cost attribution via AWS Cost Explorer on linked accounts (Source: sources/2026-02-25-aws-6000-accounts-three-people-one-platform). (concepts/account-per-tenant-isolation, concepts/blast-radius)
-
Team-size-constant scaling is the core claim. "Managing thousands of AWS accounts with three people might sound impossible. But with the right architectural choices, every new workload adds only marginal operational load while the platform absorbs the exponential scale. The team size stays constant, and efficiency grows with every account added." The entire post is framed as: shifting complexity from application development to platform development trades automation investment for linear-to-sublinear ops-team growth. (patterns/platform-engineering-investment)
-
Account lifecycle: creation automated (Step Functions), retirement manual (scripts). "Account creation is a fully automated process using AWS Step Functions, but the retirement and closure of accounts are performed manually through regularly run scripts." Explicit signal that not every workflow has to be automated equally — the criterion is overhead introduced, not dogma. (patterns/automate-account-lifecycle, systems/aws-step-functions)
-
CI/CD at thousands-of-accounts scale via CodePipeline + StackSets. Single monorepo → single CodePipeline execution → single StackSet update operation in a central account → parallel updates propagated across all target tenant accounts. Named failure modes: partial rollouts (retry/rollback must be designed + tested), pipeline duration (large-scale updates take significant time to propagate), tooling maturity (StackSets "powerful but still evolving, and operational edge cases are possible"). Monorepo is load-bearing for enforcing a single version of shared libraries / Lambda layers across thousands of accounts. (patterns/fan-out-stackset-deployment, systems/aws-stacksets)
-
Per-resource-billed services are the enemy of account-per-tenant; serverless is the enabler. "Some AWS services are billed per provisioned resource and independent of utilization as opposed to fully scaling to zero when not used." Concrete example: the smallest EC2 instance ≈ USD $3/month, which becomes $3,000/month when deployed into 1,000 accounts. "Services that scale linearly with the number of accounts should be avoided where possible." In contrast, Lambda and DynamoDB scale to zero and bill per-invocation / per-request — the per-unit price "can seem higher" but absorbs the operational overhead and idle- resource wastage that would otherwise multiply by account count. ProGlove's ~1M Lambda-function count / 120k service-instance count only works because of this. (concepts/scale-to-zero, concepts/fine-grained-billing)
-
Observability across thousands of accounts demands central aggregation without re-coupling the accounts. "Observability tooling should be centralized, but without reintroducing the very risks that accounts are meant to isolate." ProGlove forwards logs and metrics to a central third-party observability application where multi-alerts are defined once and applied across tenant accounts individually; "engineers interact with a single view, while underlying telemetry still originates from isolated accounts." Key prescriptions: don't replicate per-account alarms blindly — use streaming + aggregation; tag everything consistently (the account's source AWS ID is included in every metric/log so tenant drill-down is cheap); consider AWS Organizations tag policies to enforce the scheme. AWS's own primitive for this — CloudWatch Observability Access Manager — is called out as having "greatly improved cross-account observability features today than when we started" but is the later arrival, not the one ProGlove built against. (patterns/central-telemetry-aggregation)
-
Per-account quotas are a distributed quota-management problem. "AWS service limits are enforced per account. In a shared-account model, you monitor a single set of quotas. In an account-per-tenant setup, quota management becomes distributed and harder to predict." Concrete example: Lambda concurrent-execution quota is per-account; a tenant under heavier load "is likely for the corresponding account to experience throttling errors of Lambda functions." Proactive quota requests + a single pane of glass view to keep track of the quota usage and adapt as necessary are essential. (concepts/per-account-quotas)
-
Cross-account identity is load-bearing for the ops team. Developers, operations teams, and platform services all operate across accounts daily. This "requires a robust identity model with IAM roles and cross-account trust policies." Explicit warning: avoid long-lived credentials — "these introduce a major security threat and monitoring effort if deployed into many accounts." (systems/aws-iam)
-
Baseline guardrails = SCPs + IAM management. The operational- investment checklist: account management (automate everything from creation to decommissioning), baseline guardrails (enforce compliance + security controls via SCPs and strict IAM management), developer training, CI/CD investment, observability discipline. SCPs (concepts/service-control-policy) and AWS Organizations are the substrate.
-
Well-Architected Framework items drop out at this level of isolation. "When conducting an AWS Well-Architected Framework review together with AWS, we found that many items from the operational excellence as well as the security pillar didn't even apply to our setup anymore. This made completing those review sections quick and straightforward." Secondary signal that the architectural benefit is large enough to show up in compliance mechanics.
-
Tooling/reference-architecture gap for SaaS-tenant-level multi-account. "Although multi-account strategies are common at the enterprise level, adopting them at the SaaS tenant level is less common. Patterns, tooling, and reference architectures are still evolving, which means building custom solutions becomes necessary. Make sure to research available resources and consult AWS so you don't reinvent the wheel." The frankness is the signal: at ProGlove's scale this model is still bleeding-edge in the SaaS-on-AWS idiom.
Numbers¶
- ~6,000 AWS accounts in production (implied by title; body says "thousands").
- 3-person platform team.
- >120,000 deployed service instances.
- ~1,000,000 Lambda functions in production.
- Smallest EC2 instance ≈ USD $3/month — ~$3,000/month at 1,000- account deployment (concrete cost-multiplier illustration).
- Microservice count per tenant account: not disclosed beyond "the full set of microservices that the tenant requires."
Architecture artifacts¶
Three diagrams referenced in the post (hosted at cloudfront.net, not reproduced here):
- Multi-account hierarchy — Root + Audit + Monitoring + Deployment + Tenant accounts, with services deployed into Tenant accounts.
- Account lifecycle management — automated provisioning via Step Functions + CloudFormation, manual retirement via scripts.
- StackSet deployment topology — central Infrastructure account → CodePipeline → StackSet → many Tenant accounts in parallel.
Caveats / what's not covered¶
- No latency / throughput numbers for the control plane (account provisioning time, StackSet propagation time for fleet-wide update).
- No failure-mode / incident retrospective (partial-rollout recovery is described as a concern, not exemplified with a specific incident).
- No disclosure of total platform-engineering cost vs. a shared-account baseline; the core "team size stays constant" claim isn't quantitatively compared.
- No concrete SCP examples, no IAM-role cross-account-trust policy shape.
- Third-party observability vendor is unnamed; the post indicates one is used but treats it as implementation detail.
- Account limit per AWS Organization not discussed; ProGlove is at 6,000 accounts (default limit was 10; post-quota-raise it is typically many thousands for a mature AWS-Organizations customer, but the exact operational ceiling isn't called out).
- Retirement script content, invocation frequency, and the "regularly run" cadence are not specified.
- Monorepo structure (language, build system, deploy targets per service) is mentioned as a mechanism but not detailed.
- Post is a two-author AWS + customer collaboration (Julius Blank, ProGlove) — narrative is architectural-lessons format, not a postmortem; the AWS Architecture Blog is Tier 1 but the post format is prescriptive-retrospective rather than an AWS- service-team disclosure.
Relationship to other wiki sources¶
- sources/2026-02-05-aws-convera-verified-permissions-fine-grained-authorization — Convera's multi-tenant SaaS chose the opposite end of the spectrum: shared accounts + per-tenant policy stores in AVP (see patterns/per-tenant-policy-store, concepts/tenant-isolation). Convera enforces tenant isolation at five authorization layers inside a single account; ProGlove enforces at the account boundary and lets authorization simplify because each service instance already runs in a single-tenant account. Canonical pair of data points for the tenant-isolation-mechanism spectrum.
- sources/2026-01-30-aws-sovereign-failover-design-digital-sovereignty — the same Organizations / IAM primitives but at the partition boundary instead of the tenant boundary; illustrates Organizations-as-fabric at two different radii.
- sources/2024-11-15-allthingsdistributed-aws-lambda-prfaq-after-10-years — the ProGlove story is one of the most extreme instances of Lambda's scale-to-zero / per-invocation billing design tenets paying off: 1,000,000 Lambda functions in production is only viable because idle functions cost nothing.
- sources/2026-02-04-aws-amazon-key-eventbridge-event-driven-architecture — Amazon Key's patterns/single-bus-multi-account is the complementary EventBridge shape for cross-account traffic within a single organization; ProGlove's story is the broader "one-Organization-many-tenant-accounts" skeleton inside which such single-bus-multi-account buses live.