Skip to content

PATTERN Cited by 1 source

Dual-metric disambiguation

When a single top-line metric is elevated by two different underlying pathologies, emit a second cause-tagged metric whose values uniquely identify each pathology. Pair-wise interpretation of the two metrics distinguishes the root cause; neither alone can.

Shape

       ┌──────────────────────────────────────┐
       │  Headline metric M₁                   │
       │  rises under BOTH pathology A and B   │
       │  — cannot distinguish them alone      │
       └─────────────────┬────────────────────┘
                         │ pair with
       ┌──────────────────────────────────────┐
       │  Second metric M₂                     │
       │  tagged by cause dimension:           │
       │      tag=a → only elevated under A    │
       │      tag=b → only elevated under B    │
       └──────────────────────────────────────┘

    decision:  (M₁ high) AND (M₂[tag=b] high)  →  cause is B
               (M₁ high) AND (M₂[tag=a] high)  →  cause is A

Canonical instance — Netflix noisy-neighbor detection

The run queue latency metric runq.latency rises under two distinct causes:

  1. Noisy neighbor — a different cgroup consumes CPU, preempting this container's tasks.
  2. Self CPU-quota throttlingthis cgroup is over its CFS CPU limit, the scheduler throttles it, queue grows.

Netflix could not act on runq.latency alone: the signals are identical for the two very different remediations (platform-level vs tenant-level). So they emit a second metric — sched.switch.out — a counter of preemption events, tagged with the category of the preempting cgroup:

Tag value Meaning
same_cgroup this container's own task preempted this container's task (quota boundary, internal contention)
different_container a different container's task preempted this one
system_service a kernel thread or host systemd service preempted this one

Pair-wise interpretation (see table in concepts/cpu-throttling-vs-noisy-neighbor):

  • High runq.latency + sched.switch.out[same_cgroup] dominates → self-throttling, platform team is not the owner.
  • High runq.latency + sched.switch.out[different_container] or [system_service] dominates → noisy neighbor confirmed; actionable at the platform.

"It's important to highlight that both the runq.latency metric and the sched.switch.out metrics are needed to determine if a container is affected by noisy neighbors... simultaneous spikes in both metrics, mainly when the cause is a different container or system process, clearly indicate a noisy neighbor issue." (Source: sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf)

Why this is more than just "emit more metrics"

The two metrics are not independent. The second metric's tag values are chosen from the failure-mode taxonomy of the first metric: each tag value corresponds to a distinct cause of M₁'s elevation. A plain preemption-count metric without cause tagging would not disambiguate — both cases preempt — only a preemption metric sliced by the preempting actor's identity has the disambiguating information.

The pattern is therefore: design M₂'s tag dimension by enumerating the causal hypotheses that elevate M₁, then pick a tag value per hypothesis that is elevated only under that hypothesis.

Alternatives and why they lose

  • Thresholds + alerting rules on M₁ alone. No amount of threshold engineering on runq.latency will distinguish the two pathologies; they produce identical signals. The problem is structural, not tuning.
  • Application-level latency. Tells you there's a problem but not at which layer; still can't tell noisy-neighbor from self- throttling because the application just sees slow CPU.
  • A proxy metric (CPU steal, cgroup throttled_time). Works as a partial fix for the self-throttling side — cpu.stat exposes throttling counts directly — but you still need a separate signal for cross-cgroup preemption, and pairing it consistently with runq.latency is exactly the dual-metric pattern.
  • Per-cause bespoke dashboards. Encodes the same pair-wise interpretation socially instead of structurally. Doesn't scale to new on-calls; the pair should emit as a pair.

Generalisation axis

Wherever a single metric covers multiple failure modes, the same disambiguation design applies:

Headline metric Causes that elevate it Disambiguating second metric
runq.latency noisy neighbor / self-throttling sched.switch.out tagged by preempting cgroup class
p99 request latency backend slow / queue depth / GC pause per-span breakdown + GC-pause counter
disk p99 read latency noisy neighbor on shared media / own I/O burst per-tenant IO-depth gauge + device-busy tag
connection pool wait app slow / pool undersized pool-acquire-time + pool-in-use gauge

Seen in

Last updated · 319 distilled / 1,201 read