PATTERN Cited by 1 source
Hybrid multi-tenant architecture¶
Pattern¶
Provide tenant isolation at the compute cluster level inside shared AWS accounts — each tenant gets a dedicated ECS cluster loading only its tenant's in-memory state, while all tenants in a tier share account-level resources (VPC, ALB, IAM roles, PrivateLink endpoints to downstream services). Dependencies are pre-wired at tier creation, not at tenant onboarding, which reduces new-tenant onboarding from months to days while preserving strong per-tenant runtime isolation for stateful services.
Canonicalised on the wiki by the AWS Architecture Blog's 2026-05-12 ad-serving-platform post (Source: sources/2026-05-12-aws-building-hybrid-multi-tenant-architecture-for-stateful-services):
- Before: per-tenant AWS account with dedicated ALB + ECS. 18 clients × 4 regions = 181 targets; 52-day onboarding; 3% CPU / 19% memory / 98% wait; noisy-neighbor still present when tenants accidentally shared infrastructure.
- After: tier → cell → infra group three-level hierarchy. Route 53 weighted routing to the tier endpoint; per-tenant ECS clusters within shared-account infra groups; PrivateLink endpoints shared at tier level. 52-day onboarding → 7 days (−86%); 80% fewer infrastructure setup steps per tenant; up to 100 tenants per AWS account.
Problem the pattern solves¶
Stateful services with in-memory tenant state have a structural isolation requirement:
- Task-per-tenant doesn't work — co-located tasks share kernel page cache and can OOM each other.
- Shared-cluster doesn't work — shared heaps mean one tenant's growth triggers GC pauses across tenants.
- Account-per-tenant works but is slow — 30–60 day onboarding is a business constraint.
The hybrid pattern finds the smallest grain that structurally contains the in-memory-state blast radius (the ECS cluster) and pushes everything else (VPC, IAM, PrivateLink) to shared tier-level infrastructure where it can be amortised across tenants.
Minimum components¶
- Tier — top-level grouping of tenants by traffic / isolation / SLA profile (e.g., High TPS, Standard TPS, Low TPS). A tier owns a shared DNS endpoint, IAM role set, and PrivateLink endpoints to downstream services. See patterns/tier-cell-infra-group-hierarchy.
- Cell — an AWS account inside the tier. Cells are the horizontal-scaling unit at account level (add cells to scale beyond per-account AWS quotas).
- Infra group — a VPC + ALB + ECS-cluster set inside a cell. Infra groups are the horizontal-scaling unit within a cell (add infra groups to scale beyond per-ALB target-group quotas).
- Per-tenant ECS cluster — inside each infra group's VPC; dedicated to one tenant. See patterns/dedicated-ecs-cluster-per-tenant.
- ALB with per-tenant listener rules — path-based or header-based routing to tenant target groups. See patterns/alb-path-routing-per-tenant.
- Route 53 weighted DNS to tier endpoint — distributes traffic across ALBs (across infra groups, across cells). Tier endpoint stays stable as tier grows. See patterns/weighted-dns-traffic-shifting.
- Shared PrivateLink endpoints at tier level — pre-established at tier creation so tenants inherit connectivity automatically. See patterns/shared-privatelink-at-tier-level.
- Tier-level IAM roles — assigned to ECS task definitions so tenants inherit downstream-service permissions without per-tenant role creation.
Canonical architecture (AWS 2026-05-12)¶
Internet / upstream callers
│ HTTPS
▼
┌─────────────────────────────────────┐
│ Route 53 weighted record │
│ tier-1.us-east-1.example.com │
└─────────────────────────────────────┘
│ │
(weight) │ │ (weight)
▼ ▼
┌───────────────────┐ ┌───────────────────┐
│ Cell 1 │ │ Cell 2 │
│ (AWS account A) │ │ (AWS account B) │
│ │ │ │
│ ┌──────────────┐ │ │ ┌──────────────┐ │
│ │ Infra group 1│ │ │ │ Infra group 1│ │
│ │ VPC + ALB │ │ │ │ VPC + ALB │ │
│ │ 50 tenants │ │ │ │ 50 tenants │ │
│ └──────────────┘ │ │ └──────────────┘ │
│ ┌──────────────┐ │ │ ┌──────────────┐ │
│ │ Infra group 2│ │ │ │ Infra group 2│ │
│ │ VPC + ALB │ │ │ │ VPC + ALB │ │
│ │ 50 tenants │ │ │ │ 50 tenants │ │
│ └──────────────┘ │ │ └──────────────┘ │
│ │ │ │
│ Tier PrivateLink │ │ Tier PrivateLink │
│ endpoints → DS1 │ │ endpoints → DS1 │
│ → DS2 │ → DS2
└───────────────────┘ └───────────────────┘
│ │
└────────┬───────────────┘
▼
┌────────────────────────────┐
│ Downstream service VPCs │
│ (per-tier endpoint service)│
└────────────────────────────┘
Capacity math¶
From the AWS canonical instance:
| Quota | Value | Source |
|---|---|---|
| ALB target groups | 100 per LB | AWS quota |
| Target groups per listener rule | 5 | AWS quota |
| Listener rules (capacity-usable) | 20 per ALB | derived |
| Tenants per infra group | ~50 | 20 × 5 / 2 avg TGs per tenant |
| ECS clusters per tenant | up to 5 | design choice |
| ECS clusters per infra group | up to 100 | derived |
| Infra groups per cell | 3–4 before account limits | heuristic |
| Route 53 weighted records | 10,000 per zone | AWS quota |
| ECS tasks per service | 5,000 (per tenant!) | AWS quota |
Scaling rules (from the post)¶
- Vertical first — when a single tenant's traffic grows below the 50-tenants-per-infra-group limit, increase ECS task CPU / memory or switch to larger EC2 instance types. "Minutes vs. hours and doesn't require Route 53 changes."
- Add infra group — when approaching the 50-tenant ceiling or when multiple tenants need capacity simultaneously. Within the same cell (AWS account). Add a new Route 53 weighted record.
- Add cell — when approaching AWS account-level limits (ENIs, VPC endpoints). "Typically after 3–4 infra groups per cell." New AWS account + identical tier stack; register its ALBs in Route 53 with weighted records alongside existing cells.
| Trigger | Action | Unit added |
|---|---|---|
| ALB target-group limit (~50 tenants per infra group) | Add infra group | VPC + ALB + ECS clusters |
| AWS account-level limits (ENIs, VPC endpoints) | Add cell | AWS account |
The tier endpoint (tier-1.us-east-1.example.com) remains
stable across every growth event. Tenants don't update DNS
as the tier scales.
Observability shape¶
Two altitudes, both dimensioned:
- Tenant-level — memory usage per ECS service (70% warn /
85% critical),
TargetResponseTimeper ALB target group (100–200 ms baseline for stateful; alert on 2× for 5 min),HTTPCode_Target_5XX_Countper target group. - Tier-level — ALB
ActiveConnectionCount,ProcessedBytes, Route 53 health-check status, ECS cluster CPU / memory reservation.
Single CloudWatch log group per tier with structured log fields
(tenant_id, tier_id, region) in every entry. Per-tenant
log streams via prefix; tenant-aware CloudWatch Logs Insights
queries for cross-tenant error-rate analysis.
Measured payoff (AWS canonical instance)¶
- Tenant onboarding time: 52 days → 7 days (−86%)
- Infrastructure setup steps per tenant: −80%
- Engineering effort per onboarding: −80%
- Feature release time: 2–3 days → 1 day
- Tenant capacity: up to 100 per AWS account with cluster-level isolation
Load-bearing design decisions¶
- Cluster-level isolation, not task-level. The in-memory- state property makes smaller grains unsafe.
- Pre-wire dependencies at tier level. The single primary source of the 80% setup-step reduction. See concepts/pre-integration-at-tier-creation.
- Weighted DNS at the tier endpoint. Absorbs both horizontal-scaling levers (infra-group and cell) without client changes.
- Shared IAM roles at tier level, not per-tenant. New tenants receive tier permissions automatically; eliminates per-tenant IAM setup.
- Tier-based SLA segmentation. High / Standard / Low TPS tiers absorb SLA heterogeneity without per-tenant custom infrastructure.
When to use¶
- Stateful services with in-memory tenant state at throughput tiers where fetch-on-request is infeasible (millions of requests per second).
- Tens to low-thousands of tenants where account-per-tenant onboarding tax is unacceptable but task-per-tenant isolation is too weak.
- Moderate SLA heterogeneity across tenants — tier structure absorbs this without per-tenant customisation.
- Onboarding time is a business constraint (concurrent customer events, revenue-per-onboarded-customer, contract SLAs).
When not to use¶
- Compliance / regulatory isolation requirements — account-per-tenant (or cross-AWS-organization) may be mandatory.
- Stateless services — overkill; JWT-claim or row-level tenant isolation is sufficient.
- Very large tenants (one tenant saturates a whole AWS account's quotas) — needs account-per-tenant.
- Small tenant counts (<10) — account-per-tenant onboarding overhead amortises acceptably.
- Teams without platform-engineering capacity — running 100 ECS clusters per infra group demands discipline in deploy automation, monitoring, and rollback.
Anti-patterns¶
- Per-tenant PrivateLink / IAM role setup at onboarding — defeats the pre-integration payoff.
- Shared ECS cluster across tenants in a single infra group — re-introduces heap-sharing and noisy-neighbor exposure.
- Per-tenant Route 53 records — ties tenants to specific ALBs, preventing transparent horizontal scaling.
- Heterogeneous IAM policies across tenants in the same tier — forces per-tenant IAM setup, eliminating the tier-level amortisation.
- Tier-promotion by in-place reconfiguration — tenants should be moved to a different tier via DNS re-weighting and cache re-warm, not by rewriting tier-level shared dependencies.
Caveats¶
- ALB-level blast radius. One ALB outage affects up to 50 tenants in the infra group. Production tiers must invest in ALB availability engineering (e.g., split infra groups across AZs; Route 53 health-check-driven failover between cells).
- Pre-integration forfeits per-tenant dependency flexibility. Tenants needing a different downstream service must be promoted to a different tier rather than customised in place.
- Shared remote cache is the data-layer assumption. If the stateful service uses per-tenant databases, the data-layer sharing property breaks down and the pattern's cost model changes.
- Route 53 TTL-bounded rebalancing. Traffic shifts across cells propagate at DNS-TTL speed; rapid rebalancing during outages needs supplementary mechanisms.
- Capacity math assumes near-uniform tenants. Highly unequal tenants need explicit tenant-to-cell assignment rather than pure weighted-round-robin.
- The post doesn't disclose cost per ECS cluster vs one shared cluster. Cluster-per-tenant has non-trivial EC2 / Auto Scaling Group idle cost the post doesn't quantify.
Relationship to neighbouring patterns¶
- patterns/tier-cell-infra-group-hierarchy — the three-level scaling structure.
- patterns/dedicated-ecs-cluster-per-tenant — the isolation mechanism.
- patterns/alb-path-routing-per-tenant — the routing mechanism.
- patterns/shared-privatelink-at-tier-level — the dependency-sharing mechanism.
- patterns/weighted-dns-traffic-shifting — the traffic-distribution primitive, extended for horizontal-scale-out within a tier (not just migration-phase weighting).
- patterns/configuration-driven-tenant-onboarding — the outcome this pattern enables.
- patterns/cell-based-architecture-for-blast-radius-reduction — the Well-Architected cell-based pattern applied at the AWS account level; hybrid multi-tenant composes with it at the account (cell) grain.
Seen in¶
- sources/2026-05-12-aws-building-hybrid-multi-tenant-architecture-for-stateful-services — canonical wiki anchor. AWS Architecture Blog's disclosure of the stateful ad-serving platform migration. 52d → 7d onboarding, 80% setup-step reduction, 100 tenants/account capacity, tier / cell / infra-group three-level hierarchy, pre-wiring-at-tier-creation as the load-bearing decision.
Related¶
- concepts/hybrid-multi-tenant-architecture
- concepts/cluster-level-tenant-isolation
- concepts/in-memory-tenant-state
- concepts/pre-integration-at-tier-creation
- concepts/tenant-onboarding-time
- concepts/cell-based-architecture
- concepts/noisy-neighbor
- patterns/tier-cell-infra-group-hierarchy
- patterns/dedicated-ecs-cluster-per-tenant
- patterns/alb-path-routing-per-tenant
- patterns/shared-privatelink-at-tier-level
- patterns/weighted-dns-traffic-shifting
- patterns/configuration-driven-tenant-onboarding
- systems/amazon-ecs
- systems/aws-alb
- systems/amazon-route53
- systems/aws-privatelink
- systems/aws-iam
- systems/aws-cloudwatch