SYSTEM Cited by 1 source
Netflix runq.latency monitor¶
Netflix's per-container run-queue-latency monitor is the eBPF-based observability stack described in sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf. It runs on the Titus container platform's hosts and emits per-container Atlas metrics that identify, and attribute the cause of, noisy-neighbor CPU contention at the Linux scheduler layer.
Netflix refers to the monitor only descriptively in the post; it has no proper marketing name. This wiki page refers to it as the runq.latency monitor after the headline Atlas metric it emits.
Architecture¶
Kernel side (eBPF program attached to tp_btf/sched_wakeup,
tp_btf/sched_wakeup_new, tp_btf/sched_switch)
sched_wakeup(_new): map_update(runq_enqueued, pid → now)
│
│ BPF_MAP_TYPE_HASH
│ key=u32 pid val=u64 ts
▼
sched_switch:
ts = map_lookup(runq_enqueued, next.pid)
lat = now - ts
map_delete(runq_enqueued, next.pid)
cgroup_id = RCU-locked deref of task->cgroups->dfl_cgrp->kn->id
prev_cgroup_id = ditto for prev task
if (now - last_ts[cgroup_id] < RATE_LIMIT_NS) return; ← in-kernel
rate limit
(PERCPU_HASH,
per-cgroup-
per-CPU)
ringbuf_submit {prev_cgroup_id, cgroup_id, lat, ts}
last_ts[cgroup_id] = now
Userspace side (Go agent)
consume ringbuf events
cgroup_id → container_id (fleet control-plane map; unknown → "system")
classify prev_cgroup_id: same_cgroup / different_container / system_service
emit to Atlas:
runq.latency (percentile-timer, per container)
sched.switch.out (counter, per container, tagged by preempt cause)
Load-bearing mechanisms¶
tp_btf/tracepoints. BTF-typed attachments give the program typedtask_struct *arguments, so field reads (task->pid,task->cgroups, …) compile directly instead of going through per-kernel-version offset math. See patterns/scheduler-tracepoint-based-monitoring.- BPF RCU kfuncs.
bpf_rcu_read_lock()/bpf_rcu_read_unlock()bracket the cgroup dereference chain so the program can safely walk an RCU-protected pointer. See concepts/cgroup-id. - Per-cgroup-per-CPU rate limiter.
PERCPU_HASHkeyed by cgroup ID, checked beforebpf_ringbuf_reserve, to keep userspace CPU load bounded under high event rates. See patterns/per-cgroup-rate-limiting-in-ebpf. - Ring buffer (not perf event array). Variable-length records, single buffer (not per-CPU), no copy syscalls. The consumer is still the bottleneck — the rate limiter exists exactly to protect it.
- Two metrics, not one. The
dual-metric output
design —
runq.latency+ cause-taggedsched.switch.out— is what lets the team distinguish cross-cgroup noisy neighbor from self CFS-quota throttling. See concepts/cpu-throttling-vs-noisy-neighbor.
Output metrics¶
| Metric | Atlas type | Dimensions | Meaning |
|---|---|---|---|
runq.latency |
percentile timer | container_id | p50 / p99 / max run-queue wait time for tasks in the container |
sched.switch.out |
counter | container_id × preempt_cause ∈ {same_cgroup, different_container, system_service} |
number of times the container's tasks were preempted, tagged by who preempted them |
Healthy baseline (from the post): runq.latency p99 ≈ 83.4 µs,
occasional spikes to 400 µs on an underloaded single-container
host. Anything at or above the tens-of-milliseconds range is a
strong noisy-neighbor / throttling signal.
Why eBPF, not a kernel module¶
"While implementing this with a kernel module was feasible, we leveraged eBPF for its safety and flexibility." (Source: sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf)
The monitor is deployed fleet-wide on Titus hosts. A bug in a kernel module is a host panic; a bug in an eBPF program fails the verifier at load time. The safety argument is decisive for anything that must run on every host.
Caveats / undisclosed¶
- Fleet size, per-host CPU cost, daily sample count, ring-buffer
size (
RINGBUF_SIZE_BYTES),RATE_LIMIT_NSvalue, andMAX_TASK_ENTRIESvalues are all not given. - Only the CFS scheduler path is addressed; SCHED_DEADLINE / SCHED_RT / kernel threads aren't discussed.
- No incident post-mortem / before-after numbers — the article ends mid-example ("A Noisy Neighbor Story") at the baseline.
sched_wakeup_newhandling is alluded to but its code isn't shown.
Seen in¶
- sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf — architecture + code sketch.