CONCEPT Cited by 3 sources
Noisy neighbor¶
Definition¶
Noisy neighbor names the multi-tenant failure mode where one tenant's workload perturbs another tenant's latency/throughput — through shared queues, shared media, shared CPU, or shared network. At scale, noisy-neighbor is the central quality problem: it is what turns a service with a good average into a service with unpredictable tails.
Marc Olson's framing (EBS, 2024)¶
Early on, we knew that we needed to spread customers across many disks to achieve reasonable performance. This had a benefit, it dropped the peak outlier latency for the hottest workloads, but unfortunately it spread the inconsistent behavior out so that it impacted many customers. When one workload impacts another, we call this a "noisy neighbor." Noisy neighbors turned out to be a critical problem for the business.
Two counterintuitive lessons:
- Spreading a hot workload can increase total noisy-neighbor exposure — one tenant's peak outlier shrinks, but its variance is now a shared tax on every tenant on every spindle it touches. This is the math argument for concepts/performance-isolation instead of mere load balancing.
- Noisy neighbors don't live in one layer. In EBS they lived in disk-level variance (HDDs), hypervisor queues (Xen ring defaults capping the host at 64 outstanding IOs — see systems/xen), network queues (TCP/general-purpose tuning), and eventually SSD controller behavior. The EBS story is the iterative removal of each layer's noisy-neighbor source.
Countermeasures (as deployed in EBS)¶
- Measure everywhere. Instrument every IO at every layer; run canary workloads continuously. See patterns/full-stack-instrumentation.
- Isolate queues. patterns/loopback-isolation — replace each layer with a near-zero-latency stub to surface where interference originates.
- Offload to hardware. systems/nitro cards remove hypervisor queues and stop stealing customer CPU for IO/encryption.
- Pick transports that avoid ordering queues. systems/srd over TCP — storage IOs don't need strict in-order delivery, so don't pay for it.
- Change the storage media + custom silicon. Ultimately EBS built systems/aws-nitro-ssd to own variance at the media level too.
- Hot-swap in place. Rather than a big-bang replacement, retrofit the fleet without customer-visible disruption (see patterns/hot-swap-retrofit).
Seen in¶
- sources/2024-08-22-allthingsdistributed-continuous-reinvention-block-storage-at-aws — canonical case study; the concept names the central quality problem EBS has been solving for 15+ years.
- sources/2025-02-25-allthingsdistributed-building-and-operating-s3 — the S3 variant of the problem. At EBS scale, spreading a hot tenant widens the blast radius. At S3 scale, concepts/aggregate-demand-smoothing means any one tenant's burst is a negligible fraction of any one drive's load, so patterns/data-placement-spreading + patterns/redundancy-for-heat jointly answer noisy-neighbor. Both systems arrive at the same problem framing from different substrates and solve it at different fleet sizes.
- sources/2025-09-25-mongodb-carrying-complexity-delivering-agility — MongoDB's structural answer rather than a queue / algorithmic one: dedicated- cluster architectural isolation eliminates co-tenants entirely so there is no noisy-neighbor surface to share. "No 'noisy neighbors' because you have no neighbors. The attack surface shrinks dramatically, and resource contention disappears." Third point on the wiki's noisy-neighbor response axis: EBS → isolate at the fabric / media level; S3 → smooth aggregate demand + spread; MongoDB Atlas → eliminate the shared plane.