CONCEPT Cited by 1 source

Cluster-level tenant isolation¶

Definition¶

Cluster-level tenant isolation names the isolation grain where the compute cluster itself is the tenant boundary — each tenant gets a dedicated scheduler-managed cluster (an ECS cluster, a Kubernetes namespace-scoped cluster, etc.) that hosts only that tenant's workload. Resources inside the cluster (CPU, memory, process space, in-memory heap, network interfaces) are never shared with another tenant.

This sits at a specific grain on the isolation-shape spectrum (see concepts/tenant-isolation):

Coarser than task-per-tenant — a multi-tenant cluster scheduling one task per tenant still shares the cluster's node pool, autoscaler, kubelet, kernel cgroup accounting.
Finer than account-per-tenant — each tenant's cluster runs inside a shared AWS account with shared VPC, IAM, and PrivateLink endpoints.

Canonical wiki instance: AWS Architecture Blog's 2026-05-12 ad- serving platform (Source: sources/2026-05-12-aws-building-hybrid-multi-tenant-architecture-for-stateful-services).

Why this grain exists as a distinct choice¶

The forcing function is in-memory tenant state. When the service loads tenant-specific data into RAM at startup and serves requests from memory:

Task-per-tenant doesn't work: co-scheduled tasks on the same EC2 instance share kernel page cache and compete for RAM. One tenant's large dataset can push another tenant's pages out of cache or trigger OOM at the instance level.
Shared-cluster doesn't work: a shared JVM or process serving multiple tenants has a shared heap. One tenant's allocation pressure triggers GC that affects every other tenant's tail latency.
Account-per-tenant is overkill: running a separate AWS account for each tenant provides isolation at the compute and network and billing layers, but the onboarding tax is 30–60 days (see concepts/tenant-onboarding-time). Most of this time is spent on infrastructure that could be shared safely.

Cluster-level isolation is the smallest grain that structurally contains the in-memory-state blast radius. It does not reach as far as account-level isolation (network, IAM, billing remain shared), but it reaches exactly as far as the stateful-heap property requires.

The three enforcement layers (from AWS post)¶

Verbatim from the canonical source:

"This architecture enforces tenant isolation at three layers through customer configuration."

Routing layer — ALB listener rules match tenant identifier (path or HTTP header) and route to the correct target group.
Compute layer — each tenant has a dedicated ECS cluster; resource limits apply per cluster; cluster-level isolation minimises cross-tenant resource-consumption impact.
In-memory state layer — "because each ECS cluster is single-tenant, in-memory data loaded at startup belongs exclusively to that tenant with no shared heap between tenants."

What cluster-level isolation buys vs account-per-tenant¶

Both grains eliminate heap-sharing and compute-resource competition. The difference is what else is shared:

Layer	Account-per-tenant	Cluster-per-tenant in shared account
In-memory heap	Isolated	Isolated
Compute resources (CPU, RAM)	Isolated	Isolated
ECS cluster	Isolated	Isolated
VPC	Isolated	Shared (per infra group)
ALB	Isolated	Shared (per infra group)
IAM roles	Isolated	Shared (tier-level)
PrivateLink endpoints	Isolated	Shared (tier-level)
Billing / cost attribution	Isolated (native Cost Explorer)	Shared; requires tag-based attribution
Service control policies	Isolated	Not applicable inside one account
Onboarding time	30–60 days	~7 days
Blast radius of shared-ALB outage	None	All tenants on that ALB (up to 50)

The trade-off is explicit: cluster-per-tenant gives up account-level boundaries (blast radius, billing clarity, policy independence) in exchange for order-of-magnitude faster onboarding via pre-wired shared dependencies.

What cluster-level isolation buys vs task-level / shared-cluster¶

Both grains share the AWS account. The difference is whether any tenant can interfere with another's runtime behaviour:

Property	Shared cluster (task-per-tenant)	Dedicated cluster per tenant
In-memory heap	Shared (JVM-level) or co-located (node-level)	Isolated
OOM blast radius	Potentially cluster-wide	One tenant
GC pause cross-impact	High	None
CPU scheduler competition	Kubelet / cgroup-mediated	Dedicated nodes or isolated ASG
Per-tenant auto-scaling	Complex (must respect aggregate)	Simple (per-cluster autoscaling)
Per-tenant deployment risk	Rolling update risks other tenants	Isolated per tenant
Operational overhead	Lower	Higher (~50–100 clusters per infra group)

The trade-off is explicit the other way: cluster-per-tenant pays higher operational overhead for zero cross-tenant runtime interference, which is load-bearing for stateful-in-memory services.

Capacity math in the canonical instance¶

From the 2026-05-12 post:

100 target groups per ALB (ALB quota)
5 target groups per listener rule (ALB quota)
20 listener rules per ALB × 5 target groups = 100 capacity
Tenants per infra group: ~50 (with 2 target groups per tenant average)
ECS clusters per tenant: up to 5 (varies by tenant size)
ECS clusters per infra group: up to 100
ECS tasks per service: 5,000 — applies exclusively per tenant because cluster is single-tenant

The last number is the payoff: an ECS task-per-service limit that would be shared across tenants in a shared-cluster design becomes a per-tenant ceiling under cluster-level isolation.

Observability implication: per-tenant dimensions¶

Cluster-level isolation makes per-tenant metrics cheap:

Memory usage per ECS service — direct primary signal for in-memory-state growth (70% warn, 85% critical).
TargetResponseTime per ALB target group — per-tenant latency baseline (stateful services: 100–200 ms typical; alert on 2× baseline for 5 min).
HTTPCode_Target_5XX_Count per target group — per-tenant error rate, no cross-tenant aggregation to disentangle.

Contrast with shared-cluster: per-tenant dimensions require application-level tagging (tenant_id in every metric emission) and can't natively leverage infrastructure-level metrics.

Anti-patterns¶

Cluster-level isolation for stateless services — overkill; row-level or JWT-claim tenant isolation is sufficient.
Sharing a database across cluster-isolated tenants with no tenant-context enforcement — defeats the isolation property at the data layer. The canonical instance uses a shared remote cache, but scopes access by TENANT_ID env var at task startup.
Treating a cluster-per-tenant deployment as requiring a cell-per-tenant — conflates two distinct grains. Up to ~50 tenants share an infra group; the cell (AWS account) is a coarser unit.
Allowing cross-cluster communication within an infra group without audit — re-introduces shared-fate through the "shared" VPC. The canonical design keeps tenants strictly isolated from each other at the application layer.

Seen in¶

sources/2026-05-12-aws-building-hybrid-multi-tenant-architecture-for-stateful-services — canonical wiki instance. The AWS ad-serving platform's migration from cellular-per-account to cluster-per-tenant in shared accounts. Explicit three-layer enforcement quote; explicit rationale for why shared-cluster and shared-task approaches were rejected; naming of the ECS-task-per-service 5,000 ceiling as a per-tenant ceiling rather than a shared one.

concepts/tenant-isolation — parent framing
concepts/hybrid-multi-tenant-architecture — the enclosing architectural shape
concepts/in-memory-tenant-state — the forcing function
concepts/noisy-neighbor — the failure mode contained
concepts/account-per-tenant-isolation — the coarser grain
concepts/blast-radius
patterns/dedicated-ecs-cluster-per-tenant — the canonical pattern
patterns/hybrid-multi-tenant-architecture
systems/amazon-ecs