Skip to content

CONCEPT Cited by 7 sources

Noisy neighbor

Definition

Noisy neighbor names the multi-tenant failure mode where one tenant's workload perturbs another tenant's latency/throughput — through shared queues, shared media, shared CPU, or shared network. At scale, noisy-neighbor is the central quality problem: it is what turns a service with a good average into a service with unpredictable tails.

Marc Olson's framing (EBS, 2024)

Early on, we knew that we needed to spread customers across many disks to achieve reasonable performance. This had a benefit, it dropped the peak outlier latency for the hottest workloads, but unfortunately it spread the inconsistent behavior out so that it impacted many customers. When one workload impacts another, we call this a "noisy neighbor." Noisy neighbors turned out to be a critical problem for the business.

Two counterintuitive lessons:

  1. Spreading a hot workload can increase total noisy-neighbor exposure — one tenant's peak outlier shrinks, but its variance is now a shared tax on every tenant on every spindle it touches. This is the math argument for concepts/performance-isolation instead of mere load balancing.
  2. Noisy neighbors don't live in one layer. In EBS they lived in disk-level variance (HDDs), hypervisor queues (Xen ring defaults capping the host at 64 outstanding IOs — see systems/xen), network queues (TCP/general-purpose tuning), and eventually SSD controller behavior. The EBS story is the iterative removal of each layer's noisy-neighbor source.

Countermeasures (as deployed in EBS)

  • Measure everywhere. Instrument every IO at every layer; run canary workloads continuously. See patterns/full-stack-instrumentation.
  • Isolate queues. patterns/loopback-isolation — replace each layer with a near-zero-latency stub to surface where interference originates.
  • Offload to hardware. systems/nitro cards remove hypervisor queues and stop stealing customer CPU for IO/encryption.
  • Pick transports that avoid ordering queues. systems/srd over TCP — storage IOs don't need strict in-order delivery, so don't pay for it.
  • Change the storage media + custom silicon. Ultimately EBS built systems/aws-nitro-ssd to own variance at the media level too.
  • Hot-swap in place. Rather than a big-bang replacement, retrofit the fleet without customer-visible disruption (see patterns/hot-swap-retrofit).

Seen in

  • sources/2026-05-14-instacart-scaling-personalized-marketing-for-multi-tenant-commerce-platformsseventh response axis on the wiki's noisy-neighbor catalogue: per-tenant rate-limit budgets in a third-party SaaS API. Instacart's multi-tenant marketing platform depends on a third-party send API that "requests are rate-limited per retailer" — so the noisy-neighbor failure mode that would emerge at the rate-limit layer (one retailer's burst exhausting a global quota) is structurally prevented by the vendor's per-tenant budget, while a new sub-shape emerges: within-tenant contention (one retailer's two concurrent campaigns competing for that retailer's own budget). Distinct from the within-host EBS / S3 / Netflix shapes — the contended resource is vendor API quota rather than disk / CPU / network. Rebatching per-user events into groups of 50 (matching the vendor's batch-API max) is the upstream design pattern that amortizes the per-tenant quota across more useful work — see patterns/stream-rebatch-for-downstream-batch-api + concepts/per-tenant-rate-limit. Full response-axis catalog now: (1) EBS fabric-isolate + spread (2) S3 smooth+spread aggregate demand (3) MongoDB Atlas eliminate shared plane (4) Netflix Titus eBPF scheduler-layer attribution (5) Netflix EC2 %steal guest-visibility (6) AWS hybrid-multi-tenant cluster-level-isolation-in-shared-accounts (7) Instacart per-tenant-rate-limit + per-tenant-workspace at the vendor API grain (this).
  • sources/2026-05-12-aws-building-hybrid-multi-tenant-architecture-for-stateful-servicessixth response axis on the wiki's noisy-neighbor catalogue: cluster-level isolation inside shared AWS accounts for stateful services with in-memory tenant state. AWS's ad-serving platform explicitly names the failure mode — "When two tenants share a cluster, their in-memory data competes for the same heap. A tenant with a large dataset can trigger out-of-memory conditions that affect its neighbors." — and responds with dedicated ECS cluster per tenant inside a shared-account infra group. Neither a purely shared nor a purely isolated architecture; the isolation boundary is drawn only where the in-memory-state property demands it (cluster level), leaving VPCs, ALBs, IAM, and PrivateLink endpoints shared. Distinct from the MongoDB-Atlas "eliminate the shared plane" response because the AWS account itself is kept shared; the cluster is the isolation boundary. See concepts/cluster-level-tenant-isolation, concepts/in-memory-tenant-state, and patterns/dedicated-ecs-cluster-per-tenant. Full response-axis catalog now: (1) EBS fabric-isolate + spread (2) S3 smooth+spread aggregate demand (3) MongoDB Atlas eliminate shared plane (4) Netflix Titus eBPF scheduler-layer attribution (5) Netflix EC2 %steal guest-visibility (6) AWS hybrid-multi-tenant cluster-level-isolation-in-shared- accounts (this).
  • sources/2024-08-22-allthingsdistributed-continuous-reinvention-block-storage-at-aws — canonical case study; the concept names the central quality problem EBS has been solving for 15+ years.
  • sources/2025-02-25-allthingsdistributed-building-and-operating-s3 — the S3 variant of the problem. At EBS scale, spreading a hot tenant widens the blast radius. At S3 scale, concepts/aggregate-demand-smoothing means any one tenant's burst is a negligible fraction of any one drive's load, so patterns/data-placement-spreading + patterns/redundancy-for-heat jointly answer noisy-neighbor. Both systems arrive at the same problem framing from different substrates and solve it at different fleet sizes.
  • sources/2025-09-25-mongodb-carrying-complexity-delivering-agility — MongoDB's structural answer rather than a queue / algorithmic one: dedicated- cluster architectural isolation eliminates co-tenants entirely so there is no noisy-neighbor surface to share. "No 'noisy neighbors' because you have no neighbors. The attack surface shrinks dramatically, and resource contention disappears." Third point on the wiki's noisy-neighbor response axis: EBS → isolate at the fabric / media level; S3 → smooth aggregate demand + spread; MongoDB Atlas → eliminate the shared plane.
  • sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf — Netflix's scheduler-layer observability response. Rather than isolating / spreading / eliminating co-tenancy, Netflix instruments the CFS run queue directly with eBPF-based [sched_wakeup
  • sched_switch tracepoints](<../patterns/scheduler-tracepoint-based-monitoring.md>) and emits a per-container runq.latency percentile timer paired with a preempt-cause-tagged sched.switch.out counter — a dual-metric design that separates cross-cgroup noisy-neighbor from self-CFS-quota throttling (see concepts/cpu-throttling-vs-noisy-neighbor).
  • Customer-side operational view of EBS noisy-neighbor variance at fleet scale. The gp3 SLO formalises the variance floor ("at least 90% of provisioned IOPS 99% of the time" — 14 min/day of potential degraded operation); customer mitigation is patterns/automated-volume-health-monitoring
  • patterns/zero-downtime-reparent-on-degradation; the structural fix is to abandon the shared fabric via patterns/shared-nothing-storage-topology + systems/planetscale-metal. Sixth response axis on the noisy-neighbor wiki (EBS fabric-isolate / S3 smooth+spread / MongoDB-Atlas eliminate-shared-plane / Netflix-Titus eBPF- attribution / Netflix-ec2 %steal / PlanetScale reparent-and- skip-the-fabric). Fourth response axis on the wiki: EBS → isolate at the fabric / media; S3 → smooth demand + spread; MongoDB Atlas → eliminate the shared plane; Netflix Titus → detect + attribute at the scheduler layer via eBPF, so ops can move the offending cgroup or raise the quota. The Netflix post also supplies the first healthy baseline number on the wiki for the OS-scheduler queueing variant: p99 ≈ 83.4 µs.
  • sources/2025-07-29-netflix-linux-performance-analysis-in-60-seconds — Netflix Performance Engineering's 60-second checklist surfaces the in-guest signature of hypervisor co-tenancy: a non-zero %steal column (st) in vmstat / mpstat / top. When the EC2 hypervisor (or Xen dom0) schedules another workload on your vCPU, cycles are accounted as %steal rather than %idle — the guest sees it is not running even when the CPU looks busy from inside. Fifth response axis on the wiki: EBS → isolate at fabric/ media; S3 → smooth + spread; MongoDB Atlas → eliminate shared plane; Netflix Titus → scheduler-layer eBPF attribution; cloud-guest → %steal surface visibility via Linux CPU accounting (no mitigation on its own, but a named signal pivoting investigation to cloud-side capacity or instance family).
Last updated · 542 distilled / 1,571 read