Skip to content

SYSTEM Cited by 1 source

Netflix runq.latency monitor

Netflix's per-container run-queue-latency monitor is the eBPF-based observability stack described in sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf. It runs on the Titus container platform's hosts and emits per-container Atlas metrics that identify, and attribute the cause of, noisy-neighbor CPU contention at the Linux scheduler layer.

Netflix refers to the monitor only descriptively in the post; it has no proper marketing name. This wiki page refers to it as the runq.latency monitor after the headline Atlas metric it emits.

Architecture

Kernel side (eBPF program attached to tp_btf/sched_wakeup,
             tp_btf/sched_wakeup_new, tp_btf/sched_switch)

  sched_wakeup(_new): map_update(runq_enqueued, pid → now)
                                    │  BPF_MAP_TYPE_HASH
                                    │  key=u32 pid  val=u64 ts
  sched_switch:
      ts   = map_lookup(runq_enqueued, next.pid)
      lat  = now - ts
      map_delete(runq_enqueued, next.pid)

      cgroup_id       = RCU-locked deref of task->cgroups->dfl_cgrp->kn->id
      prev_cgroup_id  = ditto for prev task

      if (now - last_ts[cgroup_id] < RATE_LIMIT_NS) return;   ← in-kernel
                                                               rate limit
                                                               (PERCPU_HASH,
                                                                per-cgroup-
                                                                per-CPU)
      ringbuf_submit {prev_cgroup_id, cgroup_id, lat, ts}
      last_ts[cgroup_id] = now

Userspace side (Go agent)

  consume ringbuf events
  cgroup_id → container_id  (fleet control-plane map; unknown → "system")
  classify prev_cgroup_id:  same_cgroup / different_container / system_service
  emit to Atlas:
      runq.latency           (percentile-timer, per container)
      sched.switch.out       (counter, per container, tagged by preempt cause)

Load-bearing mechanisms

  • tp_btf/ tracepoints. BTF-typed attachments give the program typed task_struct * arguments, so field reads (task->pid, task->cgroups, …) compile directly instead of going through per-kernel-version offset math. See patterns/scheduler-tracepoint-based-monitoring.
  • BPF RCU kfuncs. bpf_rcu_read_lock() / bpf_rcu_read_unlock() bracket the cgroup dereference chain so the program can safely walk an RCU-protected pointer. See concepts/cgroup-id.
  • Per-cgroup-per-CPU rate limiter. PERCPU_HASH keyed by cgroup ID, checked before bpf_ringbuf_reserve, to keep userspace CPU load bounded under high event rates. See patterns/per-cgroup-rate-limiting-in-ebpf.
  • Ring buffer (not perf event array). Variable-length records, single buffer (not per-CPU), no copy syscalls. The consumer is still the bottleneck — the rate limiter exists exactly to protect it.
  • Two metrics, not one. The dual-metric output design — runq.latency + cause-tagged sched.switch.out — is what lets the team distinguish cross-cgroup noisy neighbor from self CFS-quota throttling. See concepts/cpu-throttling-vs-noisy-neighbor.

Output metrics

Metric Atlas type Dimensions Meaning
runq.latency percentile timer container_id p50 / p99 / max run-queue wait time for tasks in the container
sched.switch.out counter container_id × preempt_cause ∈ {same_cgroup, different_container, system_service} number of times the container's tasks were preempted, tagged by who preempted them

Healthy baseline (from the post): runq.latency p99 ≈ 83.4 µs, occasional spikes to 400 µs on an underloaded single-container host. Anything at or above the tens-of-milliseconds range is a strong noisy-neighbor / throttling signal.

Why eBPF, not a kernel module

"While implementing this with a kernel module was feasible, we leveraged eBPF for its safety and flexibility." (Source: sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf)

The monitor is deployed fleet-wide on Titus hosts. A bug in a kernel module is a host panic; a bug in an eBPF program fails the verifier at load time. The safety argument is decisive for anything that must run on every host.

Caveats / undisclosed

  • Fleet size, per-host CPU cost, daily sample count, ring-buffer size (RINGBUF_SIZE_BYTES), RATE_LIMIT_NS value, and MAX_TASK_ENTRIES values are all not given.
  • Only the CFS scheduler path is addressed; SCHED_DEADLINE / SCHED_RT / kernel threads aren't discussed.
  • No incident post-mortem / before-after numbers — the article ends mid-example ("A Noisy Neighbor Story") at the baseline.
  • sched_wakeup_new handling is alluded to but its code isn't shown.

Seen in

Last updated · 319 distilled / 1,201 read