CONCEPT Cited by 1 source
Hybrid multi-tenant architecture¶
Definition¶
Hybrid multi-tenant architecture names the class of SaaS deployment shapes that mix isolated and shared infrastructure at different layers of the stack rather than committing to a purely shared-everything or purely isolated-per-tenant topology. The "hybrid" adjective refers to isolation grain, not hybrid cloud.
The canonical instance on this wiki (AWS Architecture Blog, 2026-05-12, ad-serving stateful service) places the isolation boundary at the ECS cluster level inside a shared AWS account: each tenant gets its own dedicated ECS cluster (isolated compute + in-memory heap), while all tenants in a tier share the account's VPC, IAM roles, PrivateLink endpoints to downstream services, and ALB listener rules are co-located on a shared ALB per infra group.
Verbatim from the post (Source: sources/2026-05-12-aws-building-hybrid-multi-tenant-architecture-for-stateful-services):
"We designed a hybrid multi-tenant architecture that provides cluster-level isolation within shared accounts."
Isolation-grain spectrum¶
Tenant isolation admits a spectrum of grains, each with different cost / onboarding-time / isolation-strength trade-offs:
| Grain | Boundary | Shared layer | Canonical wiki instance |
|---|---|---|---|
| Account-per-tenant | AWS account | Organisation-level billing + SCPs | concepts/account-per-tenant-isolation (ProGlove — 6,000 accounts) |
| Cluster-per-tenant in shared account | ECS cluster | Account, VPC, IAM, PrivateLink, ALB | This concept (AWS 2026-05-12 ad-serving) |
| Task-per-tenant in shared cluster | ECS task / pod | Cluster scheduler, compute nodes | Generic K8s tenant-per-namespace |
| Row-per-tenant in shared database | Database row (tenant_id column) | Database schema, connection pool | concepts/tenant-isolation (Convera + Datadog) |
The hybrid architecture splits the difference between account-per-tenant and task-per-tenant — it provides enough isolation for stateful in-memory-state workloads without the 52-day-per-tenant operational overhead of account-per-tenant.
Why "hybrid" is the right word¶
The architecture is simultaneously:
- Isolated at the compute layer (ECS cluster, in-memory state, noisy-neighbor containment).
- Shared at the infrastructure layer (VPC, ALB, IAM, PrivateLink, downstream-service connectivity).
- Stateless at the data layer (a shared remote cache owned by the tier, not per-tenant data stores).
Neither a purely shared nor a purely isolated architecture. The isolation boundary is drawn only where a structural property demands it — in this case, in-memory tenant state forces cluster-level isolation to prevent noisy-neighbor OOM cascades.
The structural property the shape preserves¶
Each tenant's ECS cluster loads only its tenant's data into memory at startup. Two tenants never share a Java heap, a JVM, or an EC2 instance's RAM. Verbatim:
"Because each ECS cluster is single-tenant, in-memory data loaded at startup belongs exclusively to that tenant with no shared heap between tenants."
This is the property that makes shared-cluster and shared-task grains unavailable for stateful-in-memory services — one tenant's large dataset can trigger OOM that affects its neighbors. See concepts/in-memory-tenant-state for the full driver analysis.
The 3-level hierarchy in the canonical instance¶
The hybrid shape is organised with three nested scaling levers (see patterns/tier-cell-infra-group-hierarchy):
- Tier — top-level grouping of tenants by traffic profile (High TPS, Standard TPS, Low TPS). Tier is where shared dependencies are pre-wired.
- Cell — AWS account; the unit of horizontal scale-out at the account level.
- Infra group — the VPC + ALB + ECS-cluster set inside a cell; the unit of horizontal scale-out within an account.
"As you scale from 10 to 100 to 1,000 tenants, you will reach different AWS limits at different scales. Application Load Balancer target group limits constrain how many tenants fit in a single load balancer. AWS account limits on Elastic Network Interfaces (ENIs) and VPC endpoints constrain how many load balancers fit in a single account."
What makes the pattern work economically¶
The architecture depends on pre-wiring shared dependencies at tier creation, not at tenant onboarding. Once a tier exists with PrivateLink endpoints, IAM roles, and downstream-service integrations established, adding a tenant is a configuration change rather than an infrastructure-provisioning exercise. See concepts/pre-integration-at-tier-creation and patterns/shared-privatelink-at-tier-level.
The measured outcome (Source: sources/2026-05-12-aws-building-hybrid-multi-tenant-architecture-for-stateful-services):
- Tenant onboarding time: 52 days → 7 days (−86%)
- Infrastructure setup steps per tenant: −80%
- Engineering effort per onboarding: −80%
- Feature release time: 2–3 days → 1 day
- Tenant capacity: up to 100 tenants per AWS account with cluster-level isolation
When the shape fits¶
- Stateful services with per-tenant in-memory state that cannot safely share a heap (ad serving, session stores, per-tenant recommendation models).
- Tenant counts from tens to low thousands — below 10 tenants, account-per-tenant may still pencil out; above a few thousand, cluster-per-tenant hits account-level AWS quotas and needs re-partitioning into more cells.
- Moderate to high SLA heterogeneity across tenants — tier promotion is how different-SLA tenants get different-shaped infrastructure without per-tenant custom design.
- Engineering-onboarding-time is a business constraint — if new tenants are onboarded frequently and slow onboarding caps revenue, the architecture pays for itself quickly.
When the shape doesn't fit¶
- Compliance / regulatory regimes requiring strict per-tenant isolation — regulated financial services, healthcare, and government workloads often need account-level boundaries for audit and blast-radius purposes.
- Services with no per-tenant isolation requirement at the compute layer — stateless APIs backed by a shared database can use row-level isolation at much lower infrastructure overhead.
- Workloads where a single tenant can saturate a whole AWS account's quotas — the shape assumes tenants are small enough that ~50–100 share a cell.
- Teams without platform engineering capacity to run ~100 ECS clusters per infra group — cluster-per-tenant is not free operationally; cluster-level operations (deploys, rollbacks, monitoring) must scale with tenant count.
Relationship to neighbouring concepts¶
- concepts/cell-based-architecture — the AWS Well-Architected cell is the top-level scaling unit. A cell in this architecture is an AWS account; cells compose into tiers via Route 53 weighted routing. The hybrid multi-tenant shape is cell-based at the account level + per-tenant-cluster within each cell's infra groups.
- concepts/tenant-isolation — hybrid multi-tenant is one of several shapes on the wiki's isolation-shape spectrum, positioned between ProGlove's account-per-tenant and Convera's in-account multi-layer enforcement.
- concepts/account-per-tenant-isolation — the shape this architecture migrates away from. The post's before-state is the ProGlove-style account-per-tenant cellular architecture ("Supporting only 18 clients across four AWS Regions requires 181 separate targets"); the after-state is the hybrid shape.
- concepts/noisy-neighbor — the failure mode the shape contains, specifically at the in-memory-heap layer. Sixth wiki response-axis on noisy-neighbor: cluster-level isolation in shared accounts, alongside EBS fabric-isolate / S3 smooth+spread / MongoDB eliminate-shared-plane / Netflix eBPF attribution / Airbnb shuffle-sharding.
Caveats¶
- "Hybrid" is an overloaded term. Most industry usage of "hybrid" means on-prem + cloud. In this canonical wiki instance it means hybrid isolation grain (some layers per-tenant, others shared). Page names and tags disambiguate.
- The shape depends on stateful services being loaded from a shared cache, not from per-tenant databases. Shift the data layer to per-tenant DBs and the "shared at the data layer" property breaks down.
- Tier promotion of a tenant involves DNS re-weighting + cache re-warm, not a zero-downtime copy. The post gives the mechanism but not the end-to-end SLO.
- Single-ALB-per-infra-group is a blast- radius concern. One ALB outage affects up to 50 tenants. Production tier design must consider ALB availability engineering.
Seen in¶
- sources/2026-05-12-aws-building-hybrid-multi-tenant-architecture-for-stateful-services — canonical wiki home for the concept. AWS Architecture Blog post describing the migration of a stateful ad-serving platform (millions of requests/sec, billions $/yr revenue) from account-per-tenant cellular architecture to cluster-level isolation within shared accounts. The 52d → 7d onboarding improvement + 80% infrastructure-setup-step reduction + 100 tenants/account capacity are the defining datapoints. Full architecture diagram (Route 53 weighted routing → tier endpoint → ALBs per infra group → per-tenant ECS clusters → shared PrivateLink to downstream services) disclosed.
Related¶
- concepts/tenant-isolation
- concepts/cluster-level-tenant-isolation
- concepts/in-memory-tenant-state
- concepts/account-per-tenant-isolation
- concepts/cell-based-architecture
- concepts/noisy-neighbor
- concepts/pre-integration-at-tier-creation
- concepts/tenant-onboarding-time
- patterns/hybrid-multi-tenant-architecture
- patterns/tier-cell-infra-group-hierarchy
- patterns/dedicated-ecs-cluster-per-tenant
- patterns/shared-privatelink-at-tier-level
- systems/amazon-ecs
- systems/aws-alb
- systems/amazon-route53
- systems/aws-privatelink