CONCEPT Cited by 1 source

Run queue latency¶

Run queue latency is the time a Linux task spends in the scheduler's run queue — in the TASK_RUNNING (runnable) state — after the kernel has decided it is ready to execute but before the CFS scheduler has dispatched it onto a CPU. It is the queueing delay that co-tenants impose on each other at the OS-scheduler layer: the task is not waiting on I/O, not waiting on a lock, not blocked — just waiting its turn for a CPU core.

"Runqueue latency, a key indicator of CPU contention, is the time tasks spend in the scheduler's queue waiting for CPU time before being dispatched for execution. Prolonged runqueue latency signifies that processes waiting for CPU time are experiencing delays, which can significantly impact the performance of services running on the host." (Source: sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf)

Why it's the right metric for noisy neighbors¶

CPU-contention concepts/noisy-neighbor is, by construction, a scheduler-queue phenomenon. When workload A starves workload B by consuming CPU cycles, B's tasks aren't blocked anywhere else in the kernel — they are runnable and sitting in the run queue. Application- level latency metrics will show the symptom (p99 regression, throughput drop) but can't distinguish scheduler queueing from lock contention, slow I/O, GC pauses, or network stalls. Run queue latency is the direct measurement at the layer where the problem lives.

How it is measured¶

Two Linux scheduler tracepoints bracket the transition:

sched_wakeup / sched_wakeup_new — fired when a task moves from sleeping to runnable. Record a timestamp keyed by PID.
sched_switch — fired when a CPU switches between tasks. Look up the timestamp for the incoming task, subtract, emit the delta.

The map-key / timestamp / subtract-at-dispatch shape is the canonical patterns/scheduler-tracepoint-based-monitoring instance. An eBPF program attached to the two tracepoints (as tp_btf/sched_wakeup + tp_btf/sched_switch) can do the whole computation in-kernel with BTF-typed access to struct task_struct.

Attribution — to which container?¶

A raw per-PID sample isn't useful for a multi-tenant fleet. Netflix tags each sample with the task's cgroup ID (derived from task_struct->cgroups->dfl_cgrp->kn->id, which requires a BPF RCU read-side critical section), giving a per-container histogram (runq.latency) instead of a per-PID one. The userspace agent maps cgroup ID → container ID; unknown cgroups are attributed to system services.

Healthy baseline¶

Netflix's baseline on an underloaded single-container host:

"The 99th percentile averages 83.4µs (microseconds), serving as our baseline. Although there are some spikes reaching 400µs, the latency remains within acceptable parameters."

Sub-millisecond is the healthy range. A well-tuned CFS-scheduled host holds run-queue latency at least one order of magnitude below application latency budgets; if p99 runq latency approaches the application's own latency target, something is wrong at the scheduler layer.

The ambiguity it cannot resolve alone¶

Run-queue latency rises under two distinct causes:

Noisy neighbor — a different cgroup is consuming the CPU, preempting this cgroup's tasks.
Self-throttling — this cgroup is over its CFS CPU quota; the scheduler is throttling it, its tasks accumulate in the queue.

Both raise runq.latency identically. Distinguishing them requires a second metric paired with it — see concepts/cpu-throttling-vs-noisy-neighbor and patterns/dual-metric-disambiguation.

Seen in¶

sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf — Netflix's eBPF-based per-container run queue latency monitor; tracepoints + BPF hash map + ring buffer → Atlas percentile timer. Introduces run queue latency as the primitive noisy-neighbor signal on CFS.