Skip to content

PATTERN Cited by 1 source

Tier / cell / infra-group hierarchy

Pattern

Structure a multi-tenant AWS deployment as three nested levels of scaling hierarchy, each addressing a different AWS quota boundary:

  1. Tier — top-level logical grouping of tenants by traffic / SLA / isolation profile (e.g., High TPS, Standard TPS, Low TPS). A tier owns a stable DNS endpoint, a shared IAM role set, and pre-wired PrivateLink endpoints to downstream services.
  2. Cellan AWS account inside the tier. A cell is the unit of horizontal scale-out at the account level, bounded by AWS account-level quotas (ENIs, VPC endpoints).
  3. Infra group — a VPC + ALB + ECS-cluster set inside a cell. An infra group is the unit of horizontal scale-out within an account, bounded by ALB-level quotas (target groups per load balancer, listener rules per ALB).

Canonicalised on the wiki by the 2026-05-12 AWS Architecture Blog post (Source: sources/2026-05-12-aws-building-hybrid-multi-tenant-architecture-for-stateful-services). Verbatim rationale:

"As you scale from 10 to 100 to 1,000 tenants, you will reach different AWS limits at different scales. Application Load Balancer target group limits constrain how many tenants fit in a single load balancer. AWS account limits on Elastic Network Interfaces (ENIs) and VPC endpoints constrain how many load balancers fit in a single account. This three-level hierarchy gives you two independent scaling levers to address each constraint — add infra groups to scale within an account and add cells to scale across accounts."

Why three levels, specifically

Each level corresponds to a distinct AWS quota altitude:

Level Bounded by Scaling action
Tenant count on one ALB ALB target groups (100) / rules × TGs per rule (20 × 5 = 100)
Infra group ALB quotas Add infra group (new VPC + ALB + ECS clusters in same cell)
Cell AWS account quotas (ENIs, VPC endpoints) Add cell (new AWS account)

Two-level structures (tenant in cell or tenant in infra group) can scale only one axis at a time and hit the opposite quota prematurely. Three-level structures give two independent scaling levers that can be exercised in arbitrary combinations.

Scaling rules

From the canonical source:

  1. Vertical first for a single tenant — when a tenant's traffic grows below the 50-tenants-per-infra-group limit, scale that tenant's ECS task resources (CPU, memory) or instance type. "It's faster (minutes vs. hours) and doesn't require Route 53 changes."
  2. Horizontal via infra group — when approaching the 50-tenant-per-infra-group ceiling or when multiple tenants need capacity simultaneously. Add a new infra group (new VPC, ALB, ECS-cluster set) within the same cell. Add a Route 53 weighted record.
  3. Horizontal via cell — when approaching AWS account-level limits (ENIs, VPC endpoints). "Typically after 3–4 infra groups per cell." Provision an identical tier stack in a new AWS account; register its ALBs in Route 53 with weighted records alongside existing cells.
Trigger Action Unit added Where cost appears
ALB TG limit (~50 tenants per infra group) Add infra group VPC + ALB + ECS clusters Same AWS account
AWS account ENI / VPC-endpoint limits Add cell AWS account New billing account

Capacity math (from the AWS canonical instance)

The specific numbers derived from AWS ALB quotas:

  • 100 target groups per ALB (AWS quota)
  • 5 target groups per listener rule (AWS quota)
  • 20 listener rules per ALB (practical cap in the canonical design)
  • 20 × 5 = 100 target-group capacity per ALB
  • At 2 target groups per tenant on average, ~50 tenants per infra group
  • Up to 5 ECS clusters per tenant, up to 100 ECS clusters per infra group
  • Route 53: 10,000 weighted records per hosted zone (so tier can grow to thousands of AWS accounts without architectural changes)

What the tier endpoint stays stable through

The Route 53 tier endpoint (e.g., tier-1.us-east-1.example.com) is the single immutable DNS name that tenants' clients resolve. Every infra-group-add and cell-add adds a new weighted record to this endpoint, not a new endpoint. Benefits:

  • Tenants don't update DNS as the tier scales.
  • Tier promotion (moving tenants between tiers) is DNS re-weighting, not re-pointing.
  • Health-check-driven failover works at the tier level ("Evaluate target health" on each weighted record).

Three-level hierarchy vs two-level alternatives

Alternative Weakness
Tenant in cell (account-per-tenant) Hits account-provisioning time as the bottleneck (52 days/tenant); no amortisation
Tenant in infra group (flat, no cells) Saturates AWS account ENI / VPC-endpoint limits at ~3–4 infra groups
Tenant in shared cluster (flat, no infra groups) Cluster-level noisy-neighbor re-enters for stateful-in-memory services
Tier + infra group (no cell) Single-account blast radius, no horizontal AWS-account scaling
Tier + cell (no infra group) Expensive: requires new AWS account to absorb ALB-target-group saturation

The three-level hierarchy is minimal for handling both ALB-level and account-level AWS quotas independently.

Tier design decisions

  • Tier identity is SLA / traffic profile, not tenant count. High-TPS and Low-TPS tiers may have very different infra-group sizing.
  • Tier-level shared dependencies are the payoff. VPCs, IAM roles, PrivateLink endpoints, observability endpoints are established once per tier, not per tenant. See concepts/pre-integration-at-tier-creation.
  • Tier promotion is a supported workflow. When a tenant's traffic grows beyond the Standard TPS tier's target profile, it's moved to the High TPS tier via DNS re-weighting and cluster warm-up in the new tier.

What does NOT live at each level

  • Tier level: tenant-specific state, tenant-specific permissions, tenant-specific DNS names.
  • Cell level: downstream-service integration (lives at tier), tenant-specific configuration.
  • Infra-group level: cross-tenant shared ECS clusters, shared heap, multi-tenant tasks.

Each level has exactly one role; violations muddle scaling.

Observability implications

The hierarchy creates a natural metric dimension pyramid:

  • Per-tenant — memory, latency, error rate (dimensioned by tenant_id)
  • Per-infra-group — ALB ActiveConnectionCount, ProcessedBytes, ECS cluster reservations
  • Per-cell — account-level quota utilisation
  • Per-tier — Route 53 health checks, aggregate tier QPS

CloudWatch dimensions can carry tenant_id, infra_group_id, cell_id, tier_id in every metric for slicing.

Anti-patterns

  • Adding more infra groups when the bottleneck is an account quota. Infra groups don't reduce account-level ENI / VPC- endpoint pressure; add a new cell instead.
  • Adding a new cell when the bottleneck is ALB target groups. A new cell is expensive (new AWS account, new tier stack); adding an infra group inside the existing cell is the right response.
  • Mixing SLAs within a tier. The tier is an SLA boundary; mixed-SLA tenants force infrastructure to be sized for the worst-case, wasting capacity on the best-case.
  • Per-tenant Route 53 records. Ties tenants to specific ALBs; scaling changes require tenant-side DNS edits.
  • Heterogeneous IAM policies across tiers. Force per-tenant IAM roles; defeats tier-level amortisation.
  • Moving tenants between tiers in place. Tenant promotion should be a new ECS cluster in the target tier + DNS re-weighting + cache warm-up; in-place reconfiguration breaks tier-level shared dependencies.

Caveats

  • Tier sizing is empirical. The 50-tenants-per-infra-group number is derived from ALB quotas × average target-groups-per- tenant. Tenant heterogeneity shifts the real number up or down.
  • Cell count at saturation is "3–4 infra groups per cell." This is heuristic in the canonical post, not a hard rule; account-level quotas differ per AWS customer and change over time.
  • The hierarchy assumes stateful tenants with similar downstream dependency sets. Tenants needing different downstream services require separate tiers (new tier-creation cost).
  • Cell-level blast radius is the AWS account. An account compromise / quota lock / region issue affects the whole cell. Cells are not free as blast-radius units; ALBs within a cell can also share fault domains.

Seen in

  • sources/2026-05-12-aws-building-hybrid-multi-tenant-architecture-for-stateful-services — canonical wiki anchor. The AWS ad-serving platform's three-level hierarchy. Each level's AWS-quota rationale explicit; the "two independent scaling levers" framing verbatim; capacity math (50 tenants / 100 ECS clusters / 10,000 Route 53 records) disclosed; vertical-first + horizontal-via-infra-group + horizontal-via-cell scaling rules named.
Last updated · 542 distilled / 1,571 read