Netflix — Noisy Neighbor Detection with eBPF¶
Summary¶
Netflix describes a per-container run
queue latency monitor built on eBPF and
attached to two Linux scheduler tracepoints
(sched_wakeup +
sched_switch). The in-kernel program timestamps a task when it
becomes runnable, subtracts that timestamp when the task is dispatched
onto a CPU, and emits (cgroup_id, runq_lat, prev_cgroup_id) tuples
through an eBPF ring-buffer map to a Go userspace agent that emits two
Atlas metrics per container: a percentile
timer runq.latency and a preemption counter sched.switch.out.
Accessing the task's cgroup ID requires
entering an RCU read-side critical section from BPF via the
bpf_rcu_read_lock / bpf_rcu_read_unlock
kfuncs, a capability unique to eBPF — the
team points out that "implementing this with a kernel module was
feasible, we leveraged eBPF for its safety and flexibility." To keep
overhead bounded on hosts that schedule millions of events per second,
events are dropped in-kernel by a per-cgroup-per-CPU rate limiter
(ratelimit map) before
the ring-buffer reserve call. The headline operational insight is
that runq.latency alone is ambiguous — a container throttled
by its own cgroup CPU quota also shows
elevated run-queue latency — so Netflix pairs it with
sched.switch.out tagged by the preempting cgroup
(dual metric
disambiguation); simultaneous spikes in both, attributed to a
different container or system service, are the load-bearing
noisy-neighbor signal. Baseline on an underloaded host is given
(p99 ≈ 83.4 µs, occasional spikes to 400 µs).
Key takeaways¶
-
Run-queue latency is the right observability primitive for CFS-scheduled noisy neighbors. "Runqueue latency, a key indicator of CPU contention, is the time tasks spend in the scheduler's queue waiting for CPU time before being dispatched for execution. Prolonged runqueue latency signifies that processes waiting for CPU time are experiencing delays, which can significantly impact the performance of services running on the host." (Source: sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf.) The measurement lives at the scheduler-queue layer, not at the syscall/app layer — exactly where co-tenant contention is.
-
Two scheduler tracepoints are enough.
sched_wakeup+sched_wakeup_newrecord the enqueue time keyed by PID into aBPF_MAP_TYPE_HASH;sched_switchlooks the PID up on dispatch, subtracts, and emits. This is the canonical patterns/scheduler-tracepoint-based-monitoring shape — two hooks that bracket a kernel state transition, a hash map keyed by the task identifier, a subtraction at dispatch. -
tp_btf/tracepoints give pointers to the fulltask_struct. "One of the advantages of eBPF is its ability to provide pointers to the actual kernel data structures representing processes or threads." Unlike raw tracepoints, BTF-typed tracepoints let the BPF program navigatestruct task_structdirectly with compile- time type information — no tedious per-kernel offset plumbing. -
Reading
task->cgroups->dfl_cgrp->kn->idrequires BPF RCU kfuncs. "The cgroup information in the process struct is safeguarded by an RCU (Read Copy Update) lock. To safely access this RCU-protected information, we can leverage kfuncs in eBPF. kfuncs are kernel functions that can be called from eBPF programs." Netflix wraps the deref inbpf_rcu_read_lock()/bpf_rcu_read_unlock()— a capability added specifically so BPF programs can safely walk RCU-protected kernel structures. See concepts/cgroup-id + concepts/ebpf-verifier. -
Ring buffer is the right user/kernel transport — but raw event rates blow it up. "We chose the eBPF ring buffer. It is efficient, high-performing, and user-friendly. It can handle variable-length data records and allows data reading without necessitating extra memory copying or syscalls. However, the sheer number of data points was causing the userspace program to use too much CPU, so we implemented a rate limiter in eBPF to sample the data." The ring buffer is not the bottleneck at steady state — the userspace consumer is, because it emits metrics per event.
-
Rate-limit in-kernel, per-cgroup, per-CPU — before
ringbuf_reserve. The Netflix design storeslast_tsper(cgroup_id, cpu)in aBPF_MAP_TYPE_PERCPU_HASH; a new event for a given cgroup is dropped ifnow - last_ts < RATE_LIMIT_NS. Critically the check runs beforebpf_ringbuf_reserve, so a hot cgroup doesn't even allocate ring-buffer bytes it will not use. This is the canonical shape of patterns/per-cgroup-rate-limiting-in-ebpf — drop at the earliest point where information is sufficient to decide. -
runq.latencyalone is the wrong signal. "If a container is at or over its cgroup CPU limit, the scheduler will throttle it, resulting in an apparent spike in run queue latency due to delays in the queue. If we were only to consider this metric, we might incorrectly attribute the performance degradation to noisy neighbors when it's actually because the container is hitting its CPU quota." This is the concepts/cpu-throttling-vs-noisy-neighbor ambiguity: the scheduler's two distinct failure modes produce the same surface symptom. -
Pair with
sched.switch.outtagged by preempting cgroup to break the ambiguity. "Access to theprev_cgroup_idof the preempted process allows us to tag the metric with the cause of the preemption, whether it's due to a process within the same container (or cgroup), a process in another container, or a system service." Self-preemption + high runq.latency → likely CPU-quota throttling. Cross-cgroup preemption + high runq.latency → noisy neighbor. Canonical patterns/dual-metric-disambiguation instance. -
System-service vs container attribution via cgroup-ID lookup. "Each event includes a run queue latency sample with a cgroup ID, which we associate with containers running on the host. We categorize it as a system service if no such association is found." The userspace agent owns the cgroup-ID → container-ID map; unknown cgroup IDs are assumed to be systemd/host services — useful since kernel threads + host daemons are frequent, real noisy-neighbor sources.
-
Safety-vs-feasibility rationale for eBPF over a kernel module. "While implementing this with a kernel module was feasible, we leveraged eBPF for its safety and flexibility." Netflix runs this on its container platform (systems/netflix-titus) fleet- wide; a buggy kernel module would threaten every host's availability, whereas a bad eBPF program fails the verifier at load time. This is the canonical safety+fleet-deployment argument for eBPF over custom kernel code.
-
Baseline number: p99 ≈ 83.4 µs, spikes to 400 µs. "Below is the runq.latency metric for a server running a single container with ample CPU capacity. The 99th percentile averages 83.4µs (microseconds), serving as our baseline. Although there are some spikes reaching 400µs, the latency remains within acceptable parameters." Useful anchor for the magnitude of the problem — sub-millisecond scheduler-queue times are the healthy state; anything that moves a well-loaded tenant a couple of orders of magnitude above this is the quantitative noisy-neighbor signal.
-
Atlas emission shape: percentile timer + tagged counter. Per-container dimensions:
runq.latencyis a percentile timer (Atlas's histogram-style metric with sub-millisecond resolution),sched.switch.outis a counter tagged with preemption-cause category (same-cgroup / different-container / system-service). This shape is the template for any scheduler/queue observability metric pair — latency histogram + cause-tagged counter — see systems/netflix-atlas.
Architecture at a glance¶
Kernel (eBPF) Userspace (Go)
───────────── ──────────────
┌──────────────────┐ ┌──────────────────┐
│ tp_btf/ │ │ tp_btf/ │
│ sched_wakeup(_new)│ │ sched_switch │
└────────┬─────────┘ └────────┬─────────┘
│ store ts │ load + subtract
▼ ▼
┌─────────────────────────────────┐
│ runq_enqueued: │
│ BPF_MAP_TYPE_HASH │
│ key=u32 pid val=u64 ts_ns │
└─────────────────────────────────┘
│ get_task_cgroup_id(prev) + get_task_cgroup_id(next)
│ inside bpf_rcu_read_lock() / _unlock() kfuncs
▼
┌─────────────────────────────────┐
│ cgroup_id_to_last_event_ts: │ ← per-cgroup per-CPU rate
│ BPF_MAP_TYPE_PERCPU_HASH │ limiter — skip ringbuf
│ key=u64 cgroup_id val=u64 ts │ allocation if
│ │ now - last_ts < RATE_LIMIT_NS
└─────────────────────────────────┘
│ if not rate-limited:
│ bpf_ringbuf_reserve + submit {prev_cgroup_id, cgroup_id,
│ runq_lat, ts}
▼
┌─────────────────────────────────┐ ─────────▶ ┌─────────────────┐
│ events: BPF_MAP_TYPE_RINGBUF │ │ Go agent │
└─────────────────────────────────┘ │ cgroup_id → │
│ container_id map│
│ │ │ │
│ ▼ ▼ │
│ Atlas metrics: │
│ runq.latency │
│ (percentile │
│ timer, per- │
│ container) │
│ sched.switch. │
│ out │
│ (counter, │
│ tagged by │
│ preempt- │
│ cause class) │
└─────────────────┘
Operational numbers¶
| Item | Named data point | Notes |
|---|---|---|
| Baseline p99 runq latency (underloaded host, 1 container) | 83.4 µs | Healthy reference; well below 1 ms |
| Occasional spikes on same host | up to 400 µs | Still within tolerance |
| Event volume motivating rate limit | Not quantified — "sheer number of data points was causing the userspace program to use too much CPU" | In-kernel rate limiter required for sustainability |
| Rate-limit scope | per (cgroup_id, CPU) |
BPF_MAP_TYPE_PERCPU_HASH |
| Map size bound | MAX_TASK_ENTRIES (not specified) |
Applies to runq_enqueued + cgroup_id_to_last_event_ts |
Caveats¶
- Architecture-overview voice with partial code. Code snippets
are representative (the post elides the full
sched_switchhandler — several// ....placeholders).RATE_LIMIT_NS,MAX_TASK_ENTRIES, andRINGBUF_SIZE_BYTESconstants are declared but their actual chosen values are not disclosed. - No fleet-scale numbers. Titus fleet size, host count, daily event count, userspace-agent CPU consumption after rate limiting, or p-tail overhead of the eBPF hooks themselves are not given. The post is shape + code, not measured impact.
- No before/after. How many noisy-neighbor incidents were
detected pre- vs post-deployment, how many false attributions
were avoided by the
sched.switch.outpairing, or what percentage of hosts exhibit cross-cgroup preemption are not quantified. - Rate-limit trade-off not characterised. Per-cgroup-per-CPU
sampling will undercount burst-dominated signal; no discussion
of what window (
RATE_LIMIT_NSvalue) was chosen or how the team validated that the sampled distribution matches the full one at the percentiles that matter. - Only two tracepoints, not the full picture. The post
explicitly omits
sched_wakeup_newhandling details, I/O wait contributions to apparent queue time, and interactions with CFS CPU throttling at the quota-refill boundary.runq.latencyis motivated as the noisy-neighbor signal, but NUMA-memory- bandwidth, LLC cache, and network-softirq contention all present distinct pathologies the scheduler-queue metric won't see. - "Kernel module was feasible" framing is one-sided. eBPF's safety win is real, but the post doesn't cover the operational challenges: the verifier's complexity limit, the kernel-version matrix Netflix must support on its fleet, or BTF/CO-RE portability investment. Treat the "we chose eBPF for safety + flexibility" statement as a summary, not a full comparative architecture-decision record.
- Article ends at the baseline example. The title's "A Noisy Neighbor Story" section introduces the p99 = 83.4 µs baseline but the raw file terminates there — whatever follow-up graph or incident post-mortem existed on the original blog is not present in the ingested markdown, so the worked example of attribution-via-dual-metric isn't in the distilled content.
- Atlas emission details are thin.
runq.latency's bucketing config, histogram granularity, retention, and per-container cardinality cost on the Atlas backend are not given — Netflix previously documented Atlas separately and the 2014 Atlas intro is linked but not ingested here.
Source¶
- Original: https://netflixtechblog.com/noisy-neighbor-detection-with-ebpf-64b1f4b3bbdd
- HN discussion: news.ycombinator.com/item?id=41513860 (256 points)
- Raw markdown:
raw/netflix/2024-09-11-noisy-neighbor-detection-with-ebpf-86a1c04b.md - Referenced: Atlas — Netflix's primary telemetry platform (2014), Linux scheduler RCU-protected cgroup access, kfuncs docs, eBPF ring buffer (Andrii Nakryiko)
Related¶
- companies/netflix
- systems/ebpf · systems/netflix-runq-monitor · systems/netflix-atlas · systems/netflix-titus
- concepts/noisy-neighbor · concepts/run-queue-latency · concepts/cgroup-id · concepts/cpu-throttling-vs-noisy-neighbor · concepts/linux-cgroup · concepts/performance-isolation · concepts/ebpf-verifier
- patterns/scheduler-tracepoint-based-monitoring · patterns/dual-metric-disambiguation · patterns/per-cgroup-rate-limiting-in-ebpf