Skip to content

CONCEPT Cited by 1 source

CPU throttling vs noisy neighbor

The two distinct scheduler pathologies that present identically in run-queue latency, and which therefore cannot be told apart by that metric alone.

The two causes, same surface

For a container on a CFS-scheduled Linux host, elevated runq.latency — tasks waiting in the run queue beyond the healthy baseline — can mean either:

  1. Noisy neighbor. A different cgroup on the same host is consuming CPU cycles. This cgroup's tasks are runnable but the scheduler is giving time to someone else. The queueing delay is externally imposed.

  2. Self CPU-quota throttling. This cgroup is over its own cgroup CPU limit (cpu.max / CFS bandwidth). The scheduler throttles it; its tasks accumulate in the run queue until the next quota refill. The queueing delay is self-inflicted.

Both yield the same symptom: high runq.latency for the victim cgroup.

"If a container is at or over its cgroup CPU limit, the scheduler will throttle it, resulting in an apparent spike in run queue latency due to delays in the queue. If we were only to consider this metric, we might incorrectly attribute the performance degradation to noisy neighbors when it's actually because the container is hitting its CPU quota." (Source: sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf)

Why the distinction matters operationally

  • Noisy neighbor → platform problem. The action is fleet-level: co-tenancy policy, CPU reservation, bin-packing, evict/migrate the offending cgroup.
  • Self-throttling → tenant problem. The action is container-level: raise the tenant's CPU limit, optimise the tenant's code, remove a runaway loop. The platform team shouldn't be paged.

Mis-attributing one as the other produces the wrong remediation path, wastes on-call time, and erodes trust in the observability stack.

Breaking the ambiguity: pair with preemption-cause-tagged counter

The remedy is the dual-metric-disambiguation shape Netflix deployed: alongside runq.latency, emit a sched.switch.out counter tagged with the category of the preempting process:

runq.latency sched.switch.out tag Inferred cause
Elevated Mostly same cgroup Self-throttling (own tasks preempt each other at quota boundary)
Elevated Mostly different container Noisy neighbor (external cgroup is consuming CPU)
Elevated Mostly system service Host-side noisy neighbor (kernel thread / systemd daemon)
Baseline n/a Healthy

The tagging is possible because on sched_switch the eBPF program has access to both the incoming and outgoing task's task_struct, so get_task_cgroup_id(prev) gives the preempting cgroup. The userspace agent categorises it against the known container map.

Lessons for observability design

  1. A single scheduler metric is insufficient. When two distinct pathologies produce the same top-line signal, you must emit a second one that breaks the tie.
  2. The second metric should encode cause, not just count. A plain preemption counter would double-count both causes. The preempt-cause tag (same cgroup / different container / system service) is what carries the disambiguating information.
  3. Cost attribution is upstream of throttle-vs-neighbor diagnosis. Before an on-call can act, they need to know which cgroup is the source. That's the cgroup-ID-tagged metric's job.

Seen in

Last updated · 319 distilled / 1,201 read