CONCEPT Cited by 1 source

CPU throttling vs noisy neighbor¶

The two distinct scheduler pathologies that present identically in run-queue latency, and which therefore cannot be told apart by that metric alone.

The two causes, same surface¶

For a container on a CFS-scheduled Linux host, elevated runq.latency — tasks waiting in the run queue beyond the healthy baseline — can mean either:

Noisy neighbor. A different cgroup on the same host is consuming CPU cycles. This cgroup's tasks are runnable but the scheduler is giving time to someone else. The queueing delay is externally imposed.
Self CPU-quota throttling. This cgroup is over its own cgroup CPU limit (cpu.max / CFS bandwidth). The scheduler throttles it; its tasks accumulate in the run queue until the next quota refill. The queueing delay is self-inflicted.

Both yield the same symptom: high runq.latency for the victim cgroup.

"If a container is at or over its cgroup CPU limit, the scheduler will throttle it, resulting in an apparent spike in run queue latency due to delays in the queue. If we were only to consider this metric, we might incorrectly attribute the performance degradation to noisy neighbors when it's actually because the container is hitting its CPU quota." (Source: sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf)

Why the distinction matters operationally¶

Noisy neighbor → platform problem. The action is fleet-level: co-tenancy policy, CPU reservation, bin-packing, evict/migrate the offending cgroup.
Self-throttling → tenant problem. The action is container-level: raise the tenant's CPU limit, optimise the tenant's code, remove a runaway loop. The platform team shouldn't be paged.

Mis-attributing one as the other produces the wrong remediation path, wastes on-call time, and erodes trust in the observability stack.

Breaking the ambiguity: pair with preemption-cause-tagged counter¶

The remedy is the dual-metric-disambiguation shape Netflix deployed: alongside runq.latency, emit a sched.switch.out counter tagged with the category of the preempting process:

`runq.latency`	`sched.switch.out` tag	Inferred cause
Elevated	Mostly same cgroup	Self-throttling (own tasks preempt each other at quota boundary)
Elevated	Mostly different container	Noisy neighbor (external cgroup is consuming CPU)
Elevated	Mostly system service	Host-side noisy neighbor (kernel thread / systemd daemon)
Baseline	n/a	Healthy

The tagging is possible because on sched_switch the eBPF program has access to both the incoming and outgoing task's task_struct, so get_task_cgroup_id(prev) gives the preempting cgroup. The userspace agent categorises it against the known container map.

Lessons for observability design¶

A single scheduler metric is insufficient. When two distinct pathologies produce the same top-line signal, you must emit a second one that breaks the tie.
The second metric should encode cause, not just count. A plain preemption counter would double-count both causes. The preempt-cause tag (same cgroup / different container / system service) is what carries the disambiguating information.
Cost attribution is upstream of throttle-vs-neighbor diagnosis. Before an on-call can act, they need to know which cgroup is the source. That's the cgroup-ID-tagged metric's job.

Seen in¶

sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf — Netflix explicitly calls out this ambiguity as the motivation for pairing runq.latency with sched.switch.out tagged by the preempting cgroup's category. Canonical framing of the failure mode on the wiki.