Skip to content

AWS 2026-05-12 Tier 1

Read original ↗

AWS — Building hybrid multi-tenant architecture for stateful services on AWS

One-paragraph summary

AWS Architecture Blog post (2026-05-12) by a team running a stateful ad-serving platform at millions of requests per second and billions of dollars in annual advertising revenue. The team migrated off a per-tenant-AWS-account cellular architecture that took 52 days per onboarding and ran at 3% CPU / 19% memory / 98% wait, to a hybrid multi-tenant architecture that provides cluster-level tenant isolation within shared AWS accounts. The key structural move is a three-level hierarchy — tier → cell → infra group — where dependencies (VPC, PrivateLink endpoints, IAM roles, downstream-service connections) are pre-wired at tier creation, not at tenant onboarding. With dependencies pre-integrated, onboarding collapses to a configuration change. Measured results: onboarding 52 days → 7 days (86% reduction), infrastructure setup steps −80%, engineering effort per onboarding −80%, feature release 2–3 days → 1 day, and up to 100 tenants per AWS account with strong cluster-level isolation. Traffic is distributed by Route 53 weighted routing across per-account ALBs; each ALB routes per-tenant via path-based listener rules to dedicated ECS clusters per tenant, each loading only its tenant's in-memory state so one tenant's out-of-memory condition cannot affect neighbors.

Key takeaways

  1. The stateful-service isolation constraint drove cluster-level, not task-level, grain. The ad-serving platform loads tenant-specific data into memory at startup rather than fetching from a database per request. Verbatim: "When two tenants share a cluster, their in-memory data competes for the same heap. A tenant with a large dataset can trigger out-of-memory conditions that affect its neighbors." This made shared-task and shared-cluster approaches unworkable — the team needed cluster-level isolation as the minimum boundary. See concepts/in-memory-tenant-state and concepts/cluster-level-tenant-isolation.
  2. Per-AWS-account per-tenant cellular architecture had five named failure modes at 18 clients. Scale problem ("181 separate targets" at 18 clients × 4 regions), efficiency problem (3% CPU / 19% memory / 98% wait), onboarding problem (52 days broken down verbatim: 2 weeks AWS account provisioning + 3 weeks VPC/networking + 1 week IAM + 2 weeks downstream integration + testing), scalability problem (no concurrent tier-1 live events), noisy neighbor problem. See concepts/account-per-tenant-isolation — the starting-state shape the team migrated away from.
  3. The three-level hierarchy (tier → cell → infra group) gives two independent scaling levers. "Application Load Balancer target group limits constrain how many tenants fit in a single load balancer. AWS account limits on Elastic Network Interfaces (ENIs) and VPC endpoints constrain how many load balancers fit in a single account. This three-level hierarchy gives you two independent scaling levers to address each constraint — add infra groups to scale within an account and add cells to scale across accounts." See patterns/tier-cell-infra-group-hierarchy.
  4. Pre-wiring dependencies at tier creation (not tenant onboarding) is explicitly named as the primary source of the 80% reduction in infrastructure setup steps. Verbatim: "The key design principle is that we pre-wire downstream service dependencies at tier creation, not at tenant onboarding. AWS PrivateLink connections from the tier VPC to each downstream service VPC are established after the tier is provisioned. After onboarding tenants to that tier, they automatically inherit full downstream connectivity. This single architectural decision is the primary reason for the 80 percent reduction in infrastructure setup steps." See concepts/pre-integration-at-tier-creation and patterns/shared-privatelink-at-tier-level.
  5. Capacity math: 50 tenants per infra group, 100 ECS clusters per infra group, up to 100 tenants per AWS account. Derived from ALB quotas (100 target groups per LB, 5 targets per listener rule): 20 listener rules × 5 target groups = 50 tenants. With up to 5 ECS clusters per tenant, a single infra group hosts up to 100 ECS clusters. These are first-class design numbers the post surfaces as capacity-planning units.
  6. Weighted DNS routing absorbs both horizontal scale-out levers transparently. Adding an infra group within a cell → new weighted record; adding a new cell (new AWS account) → new weighted record. The tier endpoint (tier-1.us-east-1.example.com) remains stable for tenants throughout growth. Route 53 supports up to 10,000 weighted records per hosted zone — practically unbounded for this use case. See patterns/weighted-dns-traffic-shifting and systems/amazon-route53.
  7. Vertical-then-horizontal scaling rule for a single tenant. When a tenant's traffic grows but the 50-tenant-per-infra-group limit hasn't been reached, "use vertical scaling — it's faster (minutes vs. hours) and doesn't require Route 53 changes." Adding infra groups is reserved for when approaching the tenant-per-LB ceiling or when multiple tenants need capacity simultaneously. Adding cells is reserved for when an AWS account approaches ENI / VPC-endpoint limits, "typically after 3–4 infra groups per cell."
  8. Tenant isolation is enforced at three layers inside the shared account. Routing — ALB listener rules route per tenant identifier; compute — each tenant has a dedicated ECS cluster so resource limits apply per cluster; in-memory state — each ECS cluster is single-tenant, so in-memory data belongs exclusively to that tenant with no shared heap between tenants. See concepts/cluster-level-tenant-isolation.
  9. Observability is structured at two levels with tenant_id as a CloudWatch dimension. Tenant-level metrics (memory usage, target response time, 5XX errors per target group) with alarms at 70% memory (warn) / 85% (critical); TargetResponseTime baseline 100–200 ms for stateful services, alert when 2× baseline for >5 min. Tier-level metrics (ALB ActiveConnectionCount, ProcessedBytes, Route 53 health checks, ECS cluster CPU/memory reservation). Structured CloudWatch Logs with tenant_id, tier_id, region fields in every entry; single log group per tier with tenant-prefixed log streams. See systems/aws-cloudwatch.
  10. The 86% onboarding reduction comes from structural decoupling, not automation. The post is explicit: "onboarding dropped to seven days — primarily testing and validation, because infrastructure is pre-provisioned." Onboarding is a configuration change, not an infrastructure-provisioning exercise. The pattern does not eliminate onboarding work; it pre-runs the expensive parts once per tier rather than once per tenant. See patterns/configuration-driven-tenant-onboarding and concepts/tenant-onboarding-time.

Extracted systems

  • systems/amazon-ecs — dedicated ECS cluster per tenant; tenant_id passed as env var; ECS service registered as ALB target group; ECS limit of 5,000 tasks per service applies exclusively to one tenant when cluster is single-tenant.
  • systems/aws-alb — one ALB per infra group; tenant-specific listener rules (path-based /tenant-a/* or HTTP-header-based); hard quotas shape the 50-tenant-per-infra-group capacity ceiling.
  • systems/amazon-route53 — weighted routing distributes traffic across ALBs in multiple accounts; tier endpoint is a single stable DNS name that remains unchanged as the tier grows horizontally.
  • systems/aws-privatelink — VPC interface endpoints established at tier creation for shared downstream-service connectivity; ~$7.30/mo per endpoint + $0.01/GB data transfer; 50 tenants share one endpoint.
  • systems/aws-iam — tier-level IAM roles with permissions to access downstream services; assigned to ECS task definitions at the tier level so new tenants inherit permissions without per-tenant role creation.
  • systems/aws-cloudwatch — per-tenant metric dimensions (memory usage, TargetResponseTime, HTTPCode_Target_5XX_Count per target group); single log group per tier with structured log fields (tenant_id, tier_id, region).
  • Amazon VPC — per-infra-group VPC, the isolation unit for ALB + ECS + VPC endpoints. No dedicated wiki page yet; referenced in this source as a structural component of the infra group.

Extracted concepts

Extracted patterns

Operational numbers

Metric Before (cellular per-account) After (hybrid tier+cell+infra-group) Change
Tenant onboarding time 52 days 7 days −86%
Infrastructure setup steps per tenant baseline −80% structural
Engineering effort per onboarding baseline −80% structural
Feature release time 2–3 days 1 day −60% to −66%
Tenant capacity per AWS account 1 per account up to 100 +100×
CPU utilization (before) 3% avg (driver of redesign)
Memory utilization (before) 19% avg (driver of redesign)
Wait time (before) 98% (driver of redesign)
Targets for 18 clients × 4 regions 181 (per-account multiplier)

Capacity ceilings (from ALB + ECS quotas)

  • Target groups per ALB: 100
  • Target groups per listener rule: 5
  • Tenants per infra group (derived): 20 rules × 5 TGs = 50 tenants
  • ECS clusters per tenant: up to 5
  • ECS clusters per infra group (derived): 50 tenants × 2 avg = up to 100
  • Cells per tier before account-limit saturation: typically 3–4 infra groups per cell
  • Route 53 weighted records per hosted zone: 10,000
  • ECS tasks per service: 5,000 (applies per-tenant because cluster is single-tenant)

Cost markers

  • VPC interface endpoint: ~$7.30/month + $0.01/GB data transfer
  • At 50 tenants sharing one endpoint, per-tenant endpoint cost ≈ $0.15/month — negligible vs the engineering-time savings.

SLO baselines surfaced

  • Memory-usage alarms: 70% warn, 85% critical
  • TargetResponseTime baseline for stateful services: 100–200 ms
  • Latency alarm: 2× baseline for >5 min
  • 5XX error: per-target-group tracking with tenant attribution

Caveats

  • Ad-serving platform team is anonymous in the post. The post says "Our infrastructure handles millions of requests per second and generates billions of dollars in annual advertising revenue" and describes itself as "Our ad-serving platform" — but the writing is under the AWS Architecture Blog banner by the team, not attributed to a specific AWS-internal team (Amazon Ads? Amazon DSP? Twitch Ads?). Treat the 18-clients and 181-targets numbers as team-internal, not fleet-wide.
  • The 86% and 80% numbers are self-reported, not externally benchmarked. The "before" state includes manual processes that could have been automated separately; the gain is not solely attributable to the architecture change.
  • "Hybrid" in the post title means hybrid-isolation-grain (cluster-level in shared accounts), not hybrid-cloud (on-prem + cloud). The terminology is distinct from the industry usage of "hybrid cloud" or "hybrid storage."
  • AWS account-level hard limits on ENIs and VPC endpoints are named but not quantified in the post. The post defers to AWS service-quotas documentation.
  • The post doesn't disclose the number of tiers, cells, or infra groups currently in production, only capacity ceilings. The 18-clients starting point is the cellular-architecture state, not the current post-migration state.
  • Not all stateful-service workloads map cleanly to this pattern. The architecture assumes the stateful portion is per-tenant in-memory data that can be loaded from a shared remote cache; workloads with shared state across tenants (e.g., consensus protocols, replicated logs) don't fit this shape.
  • The cost of running ~100 ECS clusters per infra group vs one shared cluster is not quantified. Cluster-level isolation is not free at the EC2 / Auto Scaling Group level; the post doesn't disclose per-cluster idle cost.
  • Route 53 weighted routing is TTL-bound. Traffic-shift propagation is not instantaneous; for rapid rebalancing during tier-promotion events the pattern requires additional mechanisms (sticky sessions, client retries) not covered in the post.

Architecture diagram (reproduced from post)

The post's Figure 1 shows:

  • Top: Route 53 weighted DNS routing to tier endpoint (tier-1.us-east-1.example.com).
  • Middle (per cell): Multiple ALBs per infra group within a single AWS account; each ALB has tenant-specific listener rules.
  • Below ALBs: Dedicated ECS clusters per tenant inside the infra-group VPC.
  • Shared at tier level: AWS PrivateLink endpoints connecting the tier VPC to each downstream-service VPC. All tenants in the tier inherit this connectivity.

Source

Last updated · 542 distilled / 1,571 read