Skip to content

PATTERN Cited by 1 source

Dedicated ECS cluster per tenant

Pattern

Provision one dedicated ECS cluster per tenant inside a shared-account infra group. Each cluster runs only that tenant's workload, loads only that tenant's in-memory state, and is sized + autoscaled for that tenant's traffic. Tenants never share a cluster, a JVM, a Java heap, or an EC2 instance.

Canonicalised on the wiki by the 2026-05-12 AWS Architecture Blog post (Source: sources/2026-05-12-aws-building-hybrid-multi-tenant-architecture-for-stateful-services). Verbatim:

"Because each ECS cluster is single-tenant, in-memory data loaded at startup belongs exclusively to that tenant with no shared heap between tenants. ... One tenant's resource consumption can't affect another tenant's cluster."

Problem it solves

Stateful services with in-memory tenant state cannot share a heap across tenants without creating a noisy-neighbor risk — one tenant's large dataset can trigger GC pauses or OOM that affect every other tenant. Task-per-tenant on shared nodes doesn't solve the problem because co-located tasks share kernel page cache and OS-level memory accounting. Account-per-tenant solves it but adds 30–60 day onboarding latency that is often commercially unacceptable.

Dedicated ECS cluster per tenant is the smallest compute-isolation grain that structurally contains in-memory-state blast radius without paying the account-per-tenant onboarding tax.

Shape

  • ECS cluster naming convention encodes tier, cell, infra group, and tenant identity. Canonical: tier-1-cell-1-ig-1-tenant-a. Makes ownership obvious in operations + incident response.
  • Cluster VPC is the shared infra-group VPC. The cluster itself is dedicated to one tenant; the VPC is shared across tenants in the infra group.
  • ECS task definition passes TENANT_ID as an environment variable. The application reads this at startup to scope its data access (loads only that tenant's config + state from the shared remote cache).
  • Task definition env example (from the canonical source):
{
  "containerDefinitions": [{
    "name": "app",
    "image": "your-ecr-image:latest",
    "environment": [
      { "name": "TENANT_ID", "value": "tenant-a" },
      { "name": "CACHE_ENDPOINT", "value": "cache.tier-1.internal" }
    ]
  }]
}
  • EC2 Linux + Networking cluster mode (not Fargate in the canonical instance — the stateful-in-memory workload benefits from predictable EC2 sizing and Auto Scaling Group control).
  • ECS service in the cluster is registered as a target in an ALB target group, per patterns/alb-path-routing-per-tenant.
  • Autoscaling is scoped to the individual ECS service: CPU and memory utilisation metrics drive scale-up / scale-down for that tenant only. No cross-tenant scaling coupling.
  • ECS task-per-service limit (5,000) applies per tenant because cluster is single-tenant.

Cluster creation (canonical CLI example)

From the post:

aws ecs create-cluster \
  --cluster-name tier-1-cell-1-ig-1-tenant-a \
  --region us-east-1

Cluster creation is a configuration operation, not an infrastructure operation — the cluster itself inherits the infra-group VPC, security groups, and IAM roles. Onboarding a tenant is 1–5 minutes of cluster provisioning + ECS service registration + target-group attachment, not days of VPC engineering.

Resource-accounting properties

  • CPU utilisation — per-cluster CloudWatch metric; doesn't aggregate across tenants.
  • Memory utilisation — per-cluster; the primary signal for in-memory-state growth. Alarms at 70% (warn) / 85% (critical).
  • Task-level OOM — contained to one tenant's cluster; doesn't affect neighbors.
  • GC pauses — contained to one tenant's JVM; don't affect neighbors.
  • Scale events — per-cluster; autoscaler operates on one tenant's capacity.

When to use

  • Stateful services with in-memory tenant state. The canonical driver.
  • Noisy-neighbor-sensitive workloads with unpredictable per- tenant resource consumption (e.g., per-tenant ML inference, per-tenant custom indexes).
  • Per-tenant autoscaling requirements. Each tenant can scale independently.
  • Tenant counts in the tens to ~50 per infra group. Beyond that, add a new infra group (see patterns/tier-cell-infra-group-hierarchy).
  • Long-lived tenants where cluster-bring-up cost is amortised over the tenant's lifetime.

When not to use

  • Stateless services — task-per-tenant in a shared cluster is cheaper and sufficient.
  • Very ephemeral tenants (minutes or hours) — cluster provisioning overhead dominates.
  • Small tenants where a cluster is overkill — e.g., a tenant using 5% of one ECS task's capacity. Consider task-per-tenant with per-task resource limits.
  • Workloads requiring cross-tenant coordination — the isolation prevents the coordination.

Anti-patterns

  • Sharing an ECS cluster across multiple tenants. Re- introduces heap-sharing; defeats the isolation property.
  • Task-per-tenant on shared EC2 instances. Tenants can evict each other's pages from kernel cache; shared OS memory accounting can still OOM across tenants.
  • One cluster per tenant but one ECS service per cluster serving multiple tenants. The ECS service's tasks share a runtime — heap-sharing returns.
  • Per-tenant IAM roles created at cluster creation. IAM roles should be tier-level (shared) for the pre-integration pattern to work; per-tenant roles re-add onboarding tax.
  • Per-cluster custom VPC. Forfeits the shared-VPC property of the infra group; inflates networking setup.
  • Sharing an ALB target group across tenant clusters. Defeats per-tenant routing and observability; should be per-tenant target group.

Observability integration

Per-cluster metrics carry the tenant dimension naturally:

  • Memory usage per ECS service — primary signal for in-memory-state growth per tenant.
  • CPU utilisation per cluster — per-tenant autoscaling trigger.
  • Task-count per service — per-tenant capacity signal.

CloudWatch dimensions include the cluster name, which embeds tier, cell, infra-group, tenant by convention. No application-layer tagging required for infrastructure-level metrics.

Capacity math

From the canonical post:

  • Up to 5 ECS clusters per tenant (design choice for multi-region or multi-environment tenants).
  • Up to 100 ECS clusters per infra group (50 tenants × 2 clusters avg).
  • 5,000 tasks per ECS service (AWS quota) — applies per tenant since cluster is single-tenant. A single tenant can scale to 5,000 tasks without sharing the ceiling.

Caveats

  • Cluster-per-tenant has non-trivial idle cost. Minimum EC2 instance + ASG overhead × tenant count. Not quantified in the canonical post; plausibly single-digit dollars per tenant-month for a modestly-sized baseline cluster.
  • Cluster operations scale with tenant count. Deploys, rollbacks, observability configuration all need to scale per cluster. Platform-engineering investment is load-bearing.
  • Cross-cluster operations (batch jobs across tenants) need to be explicitly cross-cluster. No shared compute to ambient-run them on.
  • Cluster failure modes (node failures, ASG problems) are per-tenant, but the infra-group VPC + ALB can still fate- share across tenant clusters.
  • Warm-up time — loading tenant state into RAM at cluster startup is a cold-start cost, not an onboarding cost; measured per cluster restart, not per tenant onboarding.

Seen in

  • sources/2026-05-12-aws-building-hybrid-multi-tenant-architecture-for-stateful-services — canonical wiki anchor. AWS ad-serving platform's dedicated-ECS-cluster-per-tenant structure. Naming convention disclosed (tier-1-cell-1-ig-1-tenant-a); task-definition TENANT_ID env var shape disclosed; in-memory-state isolation property explicitly named; 5 ECS clusters per tenant and 5,000-task-per-service-per-tenant ceilings disclosed.
Last updated · 542 distilled / 1,571 read