Skip to content

NETFLIX 2024-09-11

Read original ↗

Netflix — Noisy Neighbor Detection with eBPF

Summary

Netflix describes a per-container run queue latency monitor built on eBPF and attached to two Linux scheduler tracepoints (sched_wakeup + sched_switch). The in-kernel program timestamps a task when it becomes runnable, subtracts that timestamp when the task is dispatched onto a CPU, and emits (cgroup_id, runq_lat, prev_cgroup_id) tuples through an eBPF ring-buffer map to a Go userspace agent that emits two Atlas metrics per container: a percentile timer runq.latency and a preemption counter sched.switch.out. Accessing the task's cgroup ID requires entering an RCU read-side critical section from BPF via the bpf_rcu_read_lock / bpf_rcu_read_unlock kfuncs, a capability unique to eBPF — the team points out that "implementing this with a kernel module was feasible, we leveraged eBPF for its safety and flexibility." To keep overhead bounded on hosts that schedule millions of events per second, events are dropped in-kernel by a per-cgroup-per-CPU rate limiter (ratelimit map) before the ring-buffer reserve call. The headline operational insight is that runq.latency alone is ambiguous — a container throttled by its own cgroup CPU quota also shows elevated run-queue latency — so Netflix pairs it with sched.switch.out tagged by the preempting cgroup (dual metric disambiguation); simultaneous spikes in both, attributed to a different container or system service, are the load-bearing noisy-neighbor signal. Baseline on an underloaded host is given (p99 ≈ 83.4 µs, occasional spikes to 400 µs).

Key takeaways

  1. Run-queue latency is the right observability primitive for CFS-scheduled noisy neighbors. "Runqueue latency, a key indicator of CPU contention, is the time tasks spend in the scheduler's queue waiting for CPU time before being dispatched for execution. Prolonged runqueue latency signifies that processes waiting for CPU time are experiencing delays, which can significantly impact the performance of services running on the host." (Source: sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf.) The measurement lives at the scheduler-queue layer, not at the syscall/app layer — exactly where co-tenant contention is.

  2. Two scheduler tracepoints are enough. sched_wakeup + sched_wakeup_new record the enqueue time keyed by PID into a BPF_MAP_TYPE_HASH; sched_switch looks the PID up on dispatch, subtracts, and emits. This is the canonical patterns/scheduler-tracepoint-based-monitoring shape — two hooks that bracket a kernel state transition, a hash map keyed by the task identifier, a subtraction at dispatch.

  3. tp_btf/ tracepoints give pointers to the full task_struct. "One of the advantages of eBPF is its ability to provide pointers to the actual kernel data structures representing processes or threads." Unlike raw tracepoints, BTF-typed tracepoints let the BPF program navigate struct task_struct directly with compile- time type information — no tedious per-kernel offset plumbing.

  4. Reading task->cgroups->dfl_cgrp->kn->id requires BPF RCU kfuncs. "The cgroup information in the process struct is safeguarded by an RCU (Read Copy Update) lock. To safely access this RCU-protected information, we can leverage kfuncs in eBPF. kfuncs are kernel functions that can be called from eBPF programs." Netflix wraps the deref in bpf_rcu_read_lock() / bpf_rcu_read_unlock() — a capability added specifically so BPF programs can safely walk RCU-protected kernel structures. See concepts/cgroup-id + concepts/ebpf-verifier.

  5. Ring buffer is the right user/kernel transport — but raw event rates blow it up. "We chose the eBPF ring buffer. It is efficient, high-performing, and user-friendly. It can handle variable-length data records and allows data reading without necessitating extra memory copying or syscalls. However, the sheer number of data points was causing the userspace program to use too much CPU, so we implemented a rate limiter in eBPF to sample the data." The ring buffer is not the bottleneck at steady state — the userspace consumer is, because it emits metrics per event.

  6. Rate-limit in-kernel, per-cgroup, per-CPU — before ringbuf_reserve. The Netflix design stores last_ts per (cgroup_id, cpu) in a BPF_MAP_TYPE_PERCPU_HASH; a new event for a given cgroup is dropped if now - last_ts < RATE_LIMIT_NS. Critically the check runs before bpf_ringbuf_reserve, so a hot cgroup doesn't even allocate ring-buffer bytes it will not use. This is the canonical shape of patterns/per-cgroup-rate-limiting-in-ebpf — drop at the earliest point where information is sufficient to decide.

  7. runq.latency alone is the wrong signal. "If a container is at or over its cgroup CPU limit, the scheduler will throttle it, resulting in an apparent spike in run queue latency due to delays in the queue. If we were only to consider this metric, we might incorrectly attribute the performance degradation to noisy neighbors when it's actually because the container is hitting its CPU quota." This is the concepts/cpu-throttling-vs-noisy-neighbor ambiguity: the scheduler's two distinct failure modes produce the same surface symptom.

  8. Pair with sched.switch.out tagged by preempting cgroup to break the ambiguity. "Access to the prev_cgroup_id of the preempted process allows us to tag the metric with the cause of the preemption, whether it's due to a process within the same container (or cgroup), a process in another container, or a system service." Self-preemption + high runq.latency → likely CPU-quota throttling. Cross-cgroup preemption + high runq.latency → noisy neighbor. Canonical patterns/dual-metric-disambiguation instance.

  9. System-service vs container attribution via cgroup-ID lookup. "Each event includes a run queue latency sample with a cgroup ID, which we associate with containers running on the host. We categorize it as a system service if no such association is found." The userspace agent owns the cgroup-ID → container-ID map; unknown cgroup IDs are assumed to be systemd/host services — useful since kernel threads + host daemons are frequent, real noisy-neighbor sources.

  10. Safety-vs-feasibility rationale for eBPF over a kernel module. "While implementing this with a kernel module was feasible, we leveraged eBPF for its safety and flexibility." Netflix runs this on its container platform (systems/netflix-titus) fleet- wide; a buggy kernel module would threaten every host's availability, whereas a bad eBPF program fails the verifier at load time. This is the canonical safety+fleet-deployment argument for eBPF over custom kernel code.

  11. Baseline number: p99 ≈ 83.4 µs, spikes to 400 µs. "Below is the runq.latency metric for a server running a single container with ample CPU capacity. The 99th percentile averages 83.4µs (microseconds), serving as our baseline. Although there are some spikes reaching 400µs, the latency remains within acceptable parameters." Useful anchor for the magnitude of the problem — sub-millisecond scheduler-queue times are the healthy state; anything that moves a well-loaded tenant a couple of orders of magnitude above this is the quantitative noisy-neighbor signal.

  12. Atlas emission shape: percentile timer + tagged counter. Per-container dimensions: runq.latency is a percentile timer (Atlas's histogram-style metric with sub-millisecond resolution), sched.switch.out is a counter tagged with preemption-cause category (same-cgroup / different-container / system-service). This shape is the template for any scheduler/queue observability metric pair — latency histogram + cause-tagged counter — see systems/netflix-atlas.

Architecture at a glance

 Kernel (eBPF)                                  Userspace (Go)
 ─────────────                                  ──────────────
┌──────────────────┐  ┌──────────────────┐
│ tp_btf/          │  │ tp_btf/          │
│ sched_wakeup(_new)│  │ sched_switch     │
└────────┬─────────┘  └────────┬─────────┘
         │ store ts              │ load + subtract
         ▼                       ▼
     ┌─────────────────────────────────┐
     │ runq_enqueued:                  │
     │   BPF_MAP_TYPE_HASH              │
     │   key=u32 pid  val=u64 ts_ns     │
     └─────────────────────────────────┘

         │  get_task_cgroup_id(prev) + get_task_cgroup_id(next)
         │     inside bpf_rcu_read_lock() / _unlock() kfuncs
     ┌─────────────────────────────────┐
     │ cgroup_id_to_last_event_ts:      │   ← per-cgroup per-CPU rate
     │   BPF_MAP_TYPE_PERCPU_HASH       │     limiter — skip ringbuf
     │   key=u64 cgroup_id val=u64 ts   │     allocation if
     │                                  │     now - last_ts < RATE_LIMIT_NS
     └─────────────────────────────────┘

         │  if not rate-limited:
         │    bpf_ringbuf_reserve + submit {prev_cgroup_id, cgroup_id,
         │                                  runq_lat, ts}
     ┌─────────────────────────────────┐   ─────────▶  ┌─────────────────┐
     │ events: BPF_MAP_TYPE_RINGBUF     │               │ Go agent        │
     └─────────────────────────────────┘               │ cgroup_id →     │
                                                        │ container_id map│
                                                        │  │  │          │
                                                        │  ▼  ▼          │
                                                        │ Atlas metrics: │
                                                        │  runq.latency  │
                                                        │  (percentile   │
                                                        │   timer, per-  │
                                                        │   container)   │
                                                        │  sched.switch. │
                                                        │  out           │
                                                        │  (counter,     │
                                                        │   tagged by    │
                                                        │   preempt-     │
                                                        │   cause class) │
                                                        └─────────────────┘

Operational numbers

Item Named data point Notes
Baseline p99 runq latency (underloaded host, 1 container) 83.4 µs Healthy reference; well below 1 ms
Occasional spikes on same host up to 400 µs Still within tolerance
Event volume motivating rate limit Not quantified — "sheer number of data points was causing the userspace program to use too much CPU" In-kernel rate limiter required for sustainability
Rate-limit scope per (cgroup_id, CPU) BPF_MAP_TYPE_PERCPU_HASH
Map size bound MAX_TASK_ENTRIES (not specified) Applies to runq_enqueued + cgroup_id_to_last_event_ts

Caveats

  • Architecture-overview voice with partial code. Code snippets are representative (the post elides the full sched_switch handler — several // .... placeholders). RATE_LIMIT_NS, MAX_TASK_ENTRIES, and RINGBUF_SIZE_BYTES constants are declared but their actual chosen values are not disclosed.
  • No fleet-scale numbers. Titus fleet size, host count, daily event count, userspace-agent CPU consumption after rate limiting, or p-tail overhead of the eBPF hooks themselves are not given. The post is shape + code, not measured impact.
  • No before/after. How many noisy-neighbor incidents were detected pre- vs post-deployment, how many false attributions were avoided by the sched.switch.out pairing, or what percentage of hosts exhibit cross-cgroup preemption are not quantified.
  • Rate-limit trade-off not characterised. Per-cgroup-per-CPU sampling will undercount burst-dominated signal; no discussion of what window (RATE_LIMIT_NS value) was chosen or how the team validated that the sampled distribution matches the full one at the percentiles that matter.
  • Only two tracepoints, not the full picture. The post explicitly omits sched_wakeup_new handling details, I/O wait contributions to apparent queue time, and interactions with CFS CPU throttling at the quota-refill boundary. runq.latency is motivated as the noisy-neighbor signal, but NUMA-memory- bandwidth, LLC cache, and network-softirq contention all present distinct pathologies the scheduler-queue metric won't see.
  • "Kernel module was feasible" framing is one-sided. eBPF's safety win is real, but the post doesn't cover the operational challenges: the verifier's complexity limit, the kernel-version matrix Netflix must support on its fleet, or BTF/CO-RE portability investment. Treat the "we chose eBPF for safety + flexibility" statement as a summary, not a full comparative architecture-decision record.
  • Article ends at the baseline example. The title's "A Noisy Neighbor Story" section introduces the p99 = 83.4 µs baseline but the raw file terminates there — whatever follow-up graph or incident post-mortem existed on the original blog is not present in the ingested markdown, so the worked example of attribution-via-dual-metric isn't in the distilled content.
  • Atlas emission details are thin. runq.latency's bucketing config, histogram granularity, retention, and per-container cardinality cost on the Atlas backend are not given — Netflix previously documented Atlas separately and the 2014 Atlas intro is linked but not ingested here.

Source

Last updated · 319 distilled / 1,201 read