CONCEPT Cited by 1 source
CPU throttling vs noisy neighbor¶
The two distinct scheduler pathologies that present identically in run-queue latency, and which therefore cannot be told apart by that metric alone.
The two causes, same surface¶
For a container on a CFS-scheduled Linux host, elevated
runq.latency — tasks waiting in the run queue beyond the healthy
baseline — can mean either:
-
Noisy neighbor. A different cgroup on the same host is consuming CPU cycles. This cgroup's tasks are runnable but the scheduler is giving time to someone else. The queueing delay is externally imposed.
-
Self CPU-quota throttling. This cgroup is over its own cgroup CPU limit (
cpu.max/ CFS bandwidth). The scheduler throttles it; its tasks accumulate in the run queue until the next quota refill. The queueing delay is self-inflicted.
Both yield the same symptom: high runq.latency for the victim
cgroup.
"If a container is at or over its cgroup CPU limit, the scheduler will throttle it, resulting in an apparent spike in run queue latency due to delays in the queue. If we were only to consider this metric, we might incorrectly attribute the performance degradation to noisy neighbors when it's actually because the container is hitting its CPU quota." (Source: sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf)
Why the distinction matters operationally¶
- Noisy neighbor → platform problem. The action is fleet-level: co-tenancy policy, CPU reservation, bin-packing, evict/migrate the offending cgroup.
- Self-throttling → tenant problem. The action is container-level: raise the tenant's CPU limit, optimise the tenant's code, remove a runaway loop. The platform team shouldn't be paged.
Mis-attributing one as the other produces the wrong remediation path, wastes on-call time, and erodes trust in the observability stack.
Breaking the ambiguity: pair with preemption-cause-tagged counter¶
The remedy is the
dual-metric-disambiguation
shape Netflix deployed: alongside runq.latency, emit a
sched.switch.out counter tagged with the category of the
preempting process:
runq.latency |
sched.switch.out tag |
Inferred cause |
|---|---|---|
| Elevated | Mostly same cgroup | Self-throttling (own tasks preempt each other at quota boundary) |
| Elevated | Mostly different container | Noisy neighbor (external cgroup is consuming CPU) |
| Elevated | Mostly system service | Host-side noisy neighbor (kernel thread / systemd daemon) |
| Baseline | n/a | Healthy |
The tagging is possible because on sched_switch the eBPF program
has access to both the incoming and outgoing task's task_struct, so
get_task_cgroup_id(prev) gives the preempting cgroup. The
userspace agent categorises it against the known container map.
Lessons for observability design¶
- A single scheduler metric is insufficient. When two distinct pathologies produce the same top-line signal, you must emit a second one that breaks the tie.
- The second metric should encode cause, not just count. A plain preemption counter would double-count both causes. The preempt-cause tag (same cgroup / different container / system service) is what carries the disambiguating information.
- Cost attribution is upstream of throttle-vs-neighbor diagnosis. Before an on-call can act, they need to know which cgroup is the source. That's the cgroup-ID-tagged metric's job.
Seen in¶
- sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf —
Netflix explicitly calls out this ambiguity as the motivation for
pairing
runq.latencywithsched.switch.outtagged by the preempting cgroup's category. Canonical framing of the failure mode on the wiki.