PATTERN Cited by 1 source
Dual-metric disambiguation¶
When a single top-line metric is elevated by two different underlying pathologies, emit a second cause-tagged metric whose values uniquely identify each pathology. Pair-wise interpretation of the two metrics distinguishes the root cause; neither alone can.
Shape¶
┌──────────────────────────────────────┐
│ Headline metric M₁ │
│ rises under BOTH pathology A and B │
│ — cannot distinguish them alone │
└─────────────────┬────────────────────┘
│ pair with
▼
┌──────────────────────────────────────┐
│ Second metric M₂ │
│ tagged by cause dimension: │
│ tag=a → only elevated under A │
│ tag=b → only elevated under B │
└──────────────────────────────────────┘
decision: (M₁ high) AND (M₂[tag=b] high) → cause is B
(M₁ high) AND (M₂[tag=a] high) → cause is A
Canonical instance — Netflix noisy-neighbor detection¶
The run queue latency metric runq.latency rises under two
distinct causes:
- Noisy neighbor — a different cgroup consumes CPU, preempting this container's tasks.
- Self CPU-quota throttling — this cgroup is over its CFS CPU limit, the scheduler throttles it, queue grows.
Netflix could not act on runq.latency alone: the signals are
identical for the two very different remediations (platform-level vs
tenant-level). So they emit a second metric — sched.switch.out — a
counter of preemption events, tagged with the category of the
preempting cgroup:
| Tag value | Meaning |
|---|---|
same_cgroup |
this container's own task preempted this container's task (quota boundary, internal contention) |
different_container |
a different container's task preempted this one |
system_service |
a kernel thread or host systemd service preempted this one |
Pair-wise interpretation (see table in concepts/cpu-throttling-vs-noisy-neighbor):
- High
runq.latency+sched.switch.out[same_cgroup]dominates → self-throttling, platform team is not the owner. - High
runq.latency+sched.switch.out[different_container]or[system_service]dominates → noisy neighbor confirmed; actionable at the platform.
"It's important to highlight that both the
runq.latencymetric and thesched.switch.outmetrics are needed to determine if a container is affected by noisy neighbors... simultaneous spikes in both metrics, mainly when the cause is a different container or system process, clearly indicate a noisy neighbor issue." (Source: sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf)
Why this is more than just "emit more metrics"¶
The two metrics are not independent. The second metric's tag values are chosen from the failure-mode taxonomy of the first metric: each tag value corresponds to a distinct cause of M₁'s elevation. A plain preemption-count metric without cause tagging would not disambiguate — both cases preempt — only a preemption metric sliced by the preempting actor's identity has the disambiguating information.
The pattern is therefore: design M₂'s tag dimension by enumerating the causal hypotheses that elevate M₁, then pick a tag value per hypothesis that is elevated only under that hypothesis.
Alternatives and why they lose¶
- Thresholds + alerting rules on M₁ alone. No amount of threshold
engineering on
runq.latencywill distinguish the two pathologies; they produce identical signals. The problem is structural, not tuning. - Application-level latency. Tells you there's a problem but not at which layer; still can't tell noisy-neighbor from self- throttling because the application just sees slow CPU.
- A proxy metric (CPU steal, cgroup
throttled_time). Works as a partial fix for the self-throttling side —cpu.statexposes throttling counts directly — but you still need a separate signal for cross-cgroup preemption, and pairing it consistently withrunq.latencyis exactly the dual-metric pattern. - Per-cause bespoke dashboards. Encodes the same pair-wise interpretation socially instead of structurally. Doesn't scale to new on-calls; the pair should emit as a pair.
Generalisation axis¶
Wherever a single metric covers multiple failure modes, the same disambiguation design applies:
| Headline metric | Causes that elevate it | Disambiguating second metric |
|---|---|---|
runq.latency |
noisy neighbor / self-throttling | sched.switch.out tagged by preempting cgroup class |
| p99 request latency | backend slow / queue depth / GC pause | per-span breakdown + GC-pause counter |
| disk p99 read latency | noisy neighbor on shared media / own I/O burst | per-tenant IO-depth gauge + device-busy tag |
| connection pool wait | app slow / pool undersized | pool-acquire-time + pool-in-use gauge |
Seen in¶
- sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf —
canonical instance:
runq.latencypaired with preempt-cause-taggedsched.switch.outto distinguish cross-cgroup noisy neighbor from self-imposed CPU-quota throttling.