CONCEPT Cited by 1 source
Cluster-level tenant isolation¶
Definition¶
Cluster-level tenant isolation names the isolation grain where the compute cluster itself is the tenant boundary — each tenant gets a dedicated scheduler-managed cluster (an ECS cluster, a Kubernetes namespace-scoped cluster, etc.) that hosts only that tenant's workload. Resources inside the cluster (CPU, memory, process space, in-memory heap, network interfaces) are never shared with another tenant.
This sits at a specific grain on the isolation-shape spectrum (see concepts/tenant-isolation):
- Coarser than task-per-tenant — a multi-tenant cluster scheduling one task per tenant still shares the cluster's node pool, autoscaler, kubelet, kernel cgroup accounting.
- Finer than account-per-tenant — each tenant's cluster runs inside a shared AWS account with shared VPC, IAM, and PrivateLink endpoints.
Canonical wiki instance: AWS Architecture Blog's 2026-05-12 ad- serving platform (Source: sources/2026-05-12-aws-building-hybrid-multi-tenant-architecture-for-stateful-services).
Why this grain exists as a distinct choice¶
The forcing function is in-memory tenant state. When the service loads tenant-specific data into RAM at startup and serves requests from memory:
- Task-per-tenant doesn't work: co-scheduled tasks on the same EC2 instance share kernel page cache and compete for RAM. One tenant's large dataset can push another tenant's pages out of cache or trigger OOM at the instance level.
- Shared-cluster doesn't work: a shared JVM or process serving multiple tenants has a shared heap. One tenant's allocation pressure triggers GC that affects every other tenant's tail latency.
- Account-per-tenant is overkill: running a separate AWS account for each tenant provides isolation at the compute and network and billing layers, but the onboarding tax is 30–60 days (see concepts/tenant-onboarding-time). Most of this time is spent on infrastructure that could be shared safely.
Cluster-level isolation is the smallest grain that structurally contains the in-memory-state blast radius. It does not reach as far as account-level isolation (network, IAM, billing remain shared), but it reaches exactly as far as the stateful-heap property requires.
The three enforcement layers (from AWS post)¶
Verbatim from the canonical source:
"This architecture enforces tenant isolation at three layers through customer configuration."
- Routing layer — ALB listener rules match tenant identifier (path or HTTP header) and route to the correct target group.
- Compute layer — each tenant has a dedicated ECS cluster; resource limits apply per cluster; cluster-level isolation minimises cross-tenant resource-consumption impact.
- In-memory state layer — "because each ECS cluster is single-tenant, in-memory data loaded at startup belongs exclusively to that tenant with no shared heap between tenants."
What cluster-level isolation buys vs account-per-tenant¶
Both grains eliminate heap-sharing and compute-resource competition. The difference is what else is shared:
| Layer | Account-per-tenant | Cluster-per-tenant in shared account |
|---|---|---|
| In-memory heap | Isolated | Isolated |
| Compute resources (CPU, RAM) | Isolated | Isolated |
| ECS cluster | Isolated | Isolated |
| VPC | Isolated | Shared (per infra group) |
| ALB | Isolated | Shared (per infra group) |
| IAM roles | Isolated | Shared (tier-level) |
| PrivateLink endpoints | Isolated | Shared (tier-level) |
| Billing / cost attribution | Isolated (native Cost Explorer) | Shared; requires tag-based attribution |
| Service control policies | Isolated | Not applicable inside one account |
| Onboarding time | 30–60 days | ~7 days |
| Blast radius of shared-ALB outage | None | All tenants on that ALB (up to 50) |
The trade-off is explicit: cluster-per-tenant gives up account-level boundaries (blast radius, billing clarity, policy independence) in exchange for order-of-magnitude faster onboarding via pre-wired shared dependencies.
What cluster-level isolation buys vs task-level / shared-cluster¶
Both grains share the AWS account. The difference is whether any tenant can interfere with another's runtime behaviour:
| Property | Shared cluster (task-per-tenant) | Dedicated cluster per tenant |
|---|---|---|
| In-memory heap | Shared (JVM-level) or co-located (node-level) | Isolated |
| OOM blast radius | Potentially cluster-wide | One tenant |
| GC pause cross-impact | High | None |
| CPU scheduler competition | Kubelet / cgroup-mediated | Dedicated nodes or isolated ASG |
| Per-tenant auto-scaling | Complex (must respect aggregate) | Simple (per-cluster autoscaling) |
| Per-tenant deployment risk | Rolling update risks other tenants | Isolated per tenant |
| Operational overhead | Lower | Higher (~50–100 clusters per infra group) |
The trade-off is explicit the other way: cluster-per-tenant pays higher operational overhead for zero cross-tenant runtime interference, which is load-bearing for stateful-in-memory services.
Capacity math in the canonical instance¶
From the 2026-05-12 post:
- 100 target groups per ALB (ALB quota)
- 5 target groups per listener rule (ALB quota)
- 20 listener rules per ALB × 5 target groups = 100 capacity
- Tenants per infra group: ~50 (with 2 target groups per tenant average)
- ECS clusters per tenant: up to 5 (varies by tenant size)
- ECS clusters per infra group: up to 100
- ECS tasks per service: 5,000 — applies exclusively per tenant because cluster is single-tenant
The last number is the payoff: an ECS task-per-service limit that would be shared across tenants in a shared-cluster design becomes a per-tenant ceiling under cluster-level isolation.
Observability implication: per-tenant dimensions¶
Cluster-level isolation makes per-tenant metrics cheap:
- Memory usage per ECS service — direct primary signal for in-memory-state growth (70% warn, 85% critical).
- TargetResponseTime per ALB target group — per-tenant latency baseline (stateful services: 100–200 ms typical; alert on 2× baseline for 5 min).
- HTTPCode_Target_5XX_Count per target group — per-tenant error rate, no cross-tenant aggregation to disentangle.
Contrast with shared-cluster: per-tenant dimensions require
application-level tagging (tenant_id in every metric emission)
and can't natively leverage infrastructure-level metrics.
Anti-patterns¶
- Cluster-level isolation for stateless services — overkill; row-level or JWT-claim tenant isolation is sufficient.
- Sharing a database across cluster-isolated tenants with no
tenant-context enforcement — defeats the isolation property
at the data layer. The canonical instance uses a shared remote
cache, but scopes access by
TENANT_IDenv var at task startup. - Treating a cluster-per-tenant deployment as requiring a cell-per-tenant — conflates two distinct grains. Up to ~50 tenants share an infra group; the cell (AWS account) is a coarser unit.
- Allowing cross-cluster communication within an infra group without audit — re-introduces shared-fate through the "shared" VPC. The canonical design keeps tenants strictly isolated from each other at the application layer.
Seen in¶
- sources/2026-05-12-aws-building-hybrid-multi-tenant-architecture-for-stateful-services — canonical wiki instance. The AWS ad-serving platform's migration from cellular-per-account to cluster-per-tenant in shared accounts. Explicit three-layer enforcement quote; explicit rationale for why shared-cluster and shared-task approaches were rejected; naming of the ECS-task-per-service 5,000 ceiling as a per-tenant ceiling rather than a shared one.
Related¶
- concepts/tenant-isolation — parent framing
- concepts/hybrid-multi-tenant-architecture — the enclosing architectural shape
- concepts/in-memory-tenant-state — the forcing function
- concepts/noisy-neighbor — the failure mode contained
- concepts/account-per-tenant-isolation — the coarser grain
- concepts/blast-radius
- patterns/dedicated-ecs-cluster-per-tenant — the canonical pattern
- patterns/hybrid-multi-tenant-architecture
- systems/amazon-ecs