Skip to content

PATTERN Cited by 1 source

Hybrid multi-tenant architecture

Pattern

Provide tenant isolation at the compute cluster level inside shared AWS accounts — each tenant gets a dedicated ECS cluster loading only its tenant's in-memory state, while all tenants in a tier share account-level resources (VPC, ALB, IAM roles, PrivateLink endpoints to downstream services). Dependencies are pre-wired at tier creation, not at tenant onboarding, which reduces new-tenant onboarding from months to days while preserving strong per-tenant runtime isolation for stateful services.

Canonicalised on the wiki by the AWS Architecture Blog's 2026-05-12 ad-serving-platform post (Source: sources/2026-05-12-aws-building-hybrid-multi-tenant-architecture-for-stateful-services):

  • Before: per-tenant AWS account with dedicated ALB + ECS. 18 clients × 4 regions = 181 targets; 52-day onboarding; 3% CPU / 19% memory / 98% wait; noisy-neighbor still present when tenants accidentally shared infrastructure.
  • After: tier → cell → infra group three-level hierarchy. Route 53 weighted routing to the tier endpoint; per-tenant ECS clusters within shared-account infra groups; PrivateLink endpoints shared at tier level. 52-day onboarding → 7 days (−86%); 80% fewer infrastructure setup steps per tenant; up to 100 tenants per AWS account.

Problem the pattern solves

Stateful services with in-memory tenant state have a structural isolation requirement:

  • Task-per-tenant doesn't work — co-located tasks share kernel page cache and can OOM each other.
  • Shared-cluster doesn't work — shared heaps mean one tenant's growth triggers GC pauses across tenants.
  • Account-per-tenant works but is slow — 30–60 day onboarding is a business constraint.

The hybrid pattern finds the smallest grain that structurally contains the in-memory-state blast radius (the ECS cluster) and pushes everything else (VPC, IAM, PrivateLink) to shared tier-level infrastructure where it can be amortised across tenants.

Minimum components

  1. Tier — top-level grouping of tenants by traffic / isolation / SLA profile (e.g., High TPS, Standard TPS, Low TPS). A tier owns a shared DNS endpoint, IAM role set, and PrivateLink endpoints to downstream services. See patterns/tier-cell-infra-group-hierarchy.
  2. Cell — an AWS account inside the tier. Cells are the horizontal-scaling unit at account level (add cells to scale beyond per-account AWS quotas).
  3. Infra group — a VPC + ALB + ECS-cluster set inside a cell. Infra groups are the horizontal-scaling unit within a cell (add infra groups to scale beyond per-ALB target-group quotas).
  4. Per-tenant ECS cluster — inside each infra group's VPC; dedicated to one tenant. See patterns/dedicated-ecs-cluster-per-tenant.
  5. ALB with per-tenant listener rules — path-based or header-based routing to tenant target groups. See patterns/alb-path-routing-per-tenant.
  6. Route 53 weighted DNS to tier endpoint — distributes traffic across ALBs (across infra groups, across cells). Tier endpoint stays stable as tier grows. See patterns/weighted-dns-traffic-shifting.
  7. Shared PrivateLink endpoints at tier level — pre-established at tier creation so tenants inherit connectivity automatically. See patterns/shared-privatelink-at-tier-level.
  8. Tier-level IAM roles — assigned to ECS task definitions so tenants inherit downstream-service permissions without per-tenant role creation.

Canonical architecture (AWS 2026-05-12)

                    Internet / upstream callers
                             │  HTTPS
            ┌─────────────────────────────────────┐
            │ Route 53 weighted record            │
            │ tier-1.us-east-1.example.com       │
            └─────────────────────────────────────┘
                 │                        │
        (weight) │                        │ (weight)
                 ▼                        ▼
       ┌───────────────────┐   ┌───────────────────┐
       │  Cell 1           │   │  Cell 2           │
       │  (AWS account A)  │   │  (AWS account B)  │
       │                   │   │                   │
       │  ┌──────────────┐ │   │  ┌──────────────┐ │
       │  │ Infra group 1│ │   │  │ Infra group 1│ │
       │  │   VPC + ALB  │ │   │  │   VPC + ALB  │ │
       │  │   50 tenants │ │   │  │   50 tenants │ │
       │  └──────────────┘ │   │  └──────────────┘ │
       │  ┌──────────────┐ │   │  ┌──────────────┐ │
       │  │ Infra group 2│ │   │  │ Infra group 2│ │
       │  │   VPC + ALB  │ │   │  │   VPC + ALB  │ │
       │  │   50 tenants │ │   │  │   50 tenants │ │
       │  └──────────────┘ │   │  └──────────────┘ │
       │                   │   │                   │
       │  Tier PrivateLink │   │  Tier PrivateLink │
       │  endpoints → DS1  │   │  endpoints → DS1  │
       │                 → DS2   │                 → DS2
       └───────────────────┘   └───────────────────┘
             │                        │
             └────────┬───────────────┘
           ┌────────────────────────────┐
           │ Downstream service VPCs    │
           │ (per-tier endpoint service)│
           └────────────────────────────┘

Capacity math

From the AWS canonical instance:

Quota Value Source
ALB target groups 100 per LB AWS quota
Target groups per listener rule 5 AWS quota
Listener rules (capacity-usable) 20 per ALB derived
Tenants per infra group ~50 20 × 5 / 2 avg TGs per tenant
ECS clusters per tenant up to 5 design choice
ECS clusters per infra group up to 100 derived
Infra groups per cell 3–4 before account limits heuristic
Route 53 weighted records 10,000 per zone AWS quota
ECS tasks per service 5,000 (per tenant!) AWS quota

Scaling rules (from the post)

  1. Vertical first — when a single tenant's traffic grows below the 50-tenants-per-infra-group limit, increase ECS task CPU / memory or switch to larger EC2 instance types. "Minutes vs. hours and doesn't require Route 53 changes."
  2. Add infra group — when approaching the 50-tenant ceiling or when multiple tenants need capacity simultaneously. Within the same cell (AWS account). Add a new Route 53 weighted record.
  3. Add cell — when approaching AWS account-level limits (ENIs, VPC endpoints). "Typically after 3–4 infra groups per cell." New AWS account + identical tier stack; register its ALBs in Route 53 with weighted records alongside existing cells.
Trigger Action Unit added
ALB target-group limit (~50 tenants per infra group) Add infra group VPC + ALB + ECS clusters
AWS account-level limits (ENIs, VPC endpoints) Add cell AWS account

The tier endpoint (tier-1.us-east-1.example.com) remains stable across every growth event. Tenants don't update DNS as the tier scales.

Observability shape

Two altitudes, both dimensioned:

  • Tenant-level — memory usage per ECS service (70% warn / 85% critical), TargetResponseTime per ALB target group (100–200 ms baseline for stateful; alert on 2× for 5 min), HTTPCode_Target_5XX_Count per target group.
  • Tier-level — ALB ActiveConnectionCount, ProcessedBytes, Route 53 health-check status, ECS cluster CPU / memory reservation.

Single CloudWatch log group per tier with structured log fields (tenant_id, tier_id, region) in every entry. Per-tenant log streams via prefix; tenant-aware CloudWatch Logs Insights queries for cross-tenant error-rate analysis.

Measured payoff (AWS canonical instance)

  • Tenant onboarding time: 52 days → 7 days (−86%)
  • Infrastructure setup steps per tenant: −80%
  • Engineering effort per onboarding: −80%
  • Feature release time: 2–3 days → 1 day
  • Tenant capacity: up to 100 per AWS account with cluster-level isolation

Load-bearing design decisions

  1. Cluster-level isolation, not task-level. The in-memory- state property makes smaller grains unsafe.
  2. Pre-wire dependencies at tier level. The single primary source of the 80% setup-step reduction. See concepts/pre-integration-at-tier-creation.
  3. Weighted DNS at the tier endpoint. Absorbs both horizontal-scaling levers (infra-group and cell) without client changes.
  4. Shared IAM roles at tier level, not per-tenant. New tenants receive tier permissions automatically; eliminates per-tenant IAM setup.
  5. Tier-based SLA segmentation. High / Standard / Low TPS tiers absorb SLA heterogeneity without per-tenant custom infrastructure.

When to use

  • Stateful services with in-memory tenant state at throughput tiers where fetch-on-request is infeasible (millions of requests per second).
  • Tens to low-thousands of tenants where account-per-tenant onboarding tax is unacceptable but task-per-tenant isolation is too weak.
  • Moderate SLA heterogeneity across tenants — tier structure absorbs this without per-tenant customisation.
  • Onboarding time is a business constraint (concurrent customer events, revenue-per-onboarded-customer, contract SLAs).

When not to use

  • Compliance / regulatory isolation requirements — account-per-tenant (or cross-AWS-organization) may be mandatory.
  • Stateless services — overkill; JWT-claim or row-level tenant isolation is sufficient.
  • Very large tenants (one tenant saturates a whole AWS account's quotas) — needs account-per-tenant.
  • Small tenant counts (<10) — account-per-tenant onboarding overhead amortises acceptably.
  • Teams without platform-engineering capacity — running 100 ECS clusters per infra group demands discipline in deploy automation, monitoring, and rollback.

Anti-patterns

  • Per-tenant PrivateLink / IAM role setup at onboarding — defeats the pre-integration payoff.
  • Shared ECS cluster across tenants in a single infra group — re-introduces heap-sharing and noisy-neighbor exposure.
  • Per-tenant Route 53 records — ties tenants to specific ALBs, preventing transparent horizontal scaling.
  • Heterogeneous IAM policies across tenants in the same tier — forces per-tenant IAM setup, eliminating the tier-level amortisation.
  • Tier-promotion by in-place reconfiguration — tenants should be moved to a different tier via DNS re-weighting and cache re-warm, not by rewriting tier-level shared dependencies.

Caveats

  • ALB-level blast radius. One ALB outage affects up to 50 tenants in the infra group. Production tiers must invest in ALB availability engineering (e.g., split infra groups across AZs; Route 53 health-check-driven failover between cells).
  • Pre-integration forfeits per-tenant dependency flexibility. Tenants needing a different downstream service must be promoted to a different tier rather than customised in place.
  • Shared remote cache is the data-layer assumption. If the stateful service uses per-tenant databases, the data-layer sharing property breaks down and the pattern's cost model changes.
  • Route 53 TTL-bounded rebalancing. Traffic shifts across cells propagate at DNS-TTL speed; rapid rebalancing during outages needs supplementary mechanisms.
  • Capacity math assumes near-uniform tenants. Highly unequal tenants need explicit tenant-to-cell assignment rather than pure weighted-round-robin.
  • The post doesn't disclose cost per ECS cluster vs one shared cluster. Cluster-per-tenant has non-trivial EC2 / Auto Scaling Group idle cost the post doesn't quantify.

Relationship to neighbouring patterns

Seen in

Last updated · 542 distilled / 1,571 read