Skip to content

CONCEPT Cited by 3 sources

Noisy neighbor

Definition

Noisy neighbor names the multi-tenant failure mode where one tenant's workload perturbs another tenant's latency/throughput — through shared queues, shared media, shared CPU, or shared network. At scale, noisy-neighbor is the central quality problem: it is what turns a service with a good average into a service with unpredictable tails.

Marc Olson's framing (EBS, 2024)

Early on, we knew that we needed to spread customers across many disks to achieve reasonable performance. This had a benefit, it dropped the peak outlier latency for the hottest workloads, but unfortunately it spread the inconsistent behavior out so that it impacted many customers. When one workload impacts another, we call this a "noisy neighbor." Noisy neighbors turned out to be a critical problem for the business.

Two counterintuitive lessons:

  1. Spreading a hot workload can increase total noisy-neighbor exposure — one tenant's peak outlier shrinks, but its variance is now a shared tax on every tenant on every spindle it touches. This is the math argument for concepts/performance-isolation instead of mere load balancing.
  2. Noisy neighbors don't live in one layer. In EBS they lived in disk-level variance (HDDs), hypervisor queues (Xen ring defaults capping the host at 64 outstanding IOs — see systems/xen), network queues (TCP/general-purpose tuning), and eventually SSD controller behavior. The EBS story is the iterative removal of each layer's noisy-neighbor source.

Countermeasures (as deployed in EBS)

  • Measure everywhere. Instrument every IO at every layer; run canary workloads continuously. See patterns/full-stack-instrumentation.
  • Isolate queues. patterns/loopback-isolation — replace each layer with a near-zero-latency stub to surface where interference originates.
  • Offload to hardware. systems/nitro cards remove hypervisor queues and stop stealing customer CPU for IO/encryption.
  • Pick transports that avoid ordering queues. systems/srd over TCP — storage IOs don't need strict in-order delivery, so don't pay for it.
  • Change the storage media + custom silicon. Ultimately EBS built systems/aws-nitro-ssd to own variance at the media level too.
  • Hot-swap in place. Rather than a big-bang replacement, retrofit the fleet without customer-visible disruption (see patterns/hot-swap-retrofit).

Seen in

Last updated · 200 distilled / 1,178 read