PATTERN Cited by 1 source

Dual-metric disambiguation¶

When a single top-line metric is elevated by two different underlying pathologies, emit a second cause-tagged metric whose values uniquely identify each pathology. Pair-wise interpretation of the two metrics distinguishes the root cause; neither alone can.

Shape¶

       ┌──────────────────────────────────────┐
       │  Headline metric M₁                   │
       │  rises under BOTH pathology A and B   │
       │  — cannot distinguish them alone      │
       └─────────────────┬────────────────────┘
                         │ pair with
                         ▼
       ┌──────────────────────────────────────┐
       │  Second metric M₂                     │
       │  tagged by cause dimension:           │
       │      tag=a → only elevated under A    │
       │      tag=b → only elevated under B    │
       └──────────────────────────────────────┘

    decision:  (M₁ high) AND (M₂[tag=b] high)  →  cause is B
               (M₁ high) AND (M₂[tag=a] high)  →  cause is A

Canonical instance — Netflix noisy-neighbor detection¶

The run queue latency metric runq.latency rises under two distinct causes:

Noisy neighbor — a different cgroup consumes CPU, preempting this container's tasks.
Self CPU-quota throttling — this cgroup is over its CFS CPU limit, the scheduler throttles it, queue grows.

Netflix could not act on runq.latency alone: the signals are identical for the two very different remediations (platform-level vs tenant-level). So they emit a second metric — sched.switch.out — a counter of preemption events, tagged with the category of the preempting cgroup:

Tag value	Meaning
`same_cgroup`	this container's own task preempted this container's task (quota boundary, internal contention)
`different_container`	a different container's task preempted this one
`system_service`	a kernel thread or host systemd service preempted this one

Pair-wise interpretation (see table in concepts/cpu-throttling-vs-noisy-neighbor):

High runq.latency + sched.switch.out[same_cgroup] dominates → self-throttling, platform team is not the owner.
High runq.latency + sched.switch.out[different_container] or [system_service] dominates → noisy neighbor confirmed; actionable at the platform.

"It's important to highlight that both the runq.latency metric and the sched.switch.out metrics are needed to determine if a container is affected by noisy neighbors... simultaneous spikes in both metrics, mainly when the cause is a different container or system process, clearly indicate a noisy neighbor issue." (Source: sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf)

Why this is more than just "emit more metrics"¶

The two metrics are not independent. The second metric's tag values are chosen from the failure-mode taxonomy of the first metric: each tag value corresponds to a distinct cause of M₁'s elevation. A plain preemption-count metric without cause tagging would not disambiguate — both cases preempt — only a preemption metric sliced by the preempting actor's identity has the disambiguating information.

The pattern is therefore: design M₂'s tag dimension by enumerating the causal hypotheses that elevate M₁, then pick a tag value per hypothesis that is elevated only under that hypothesis.

Alternatives and why they lose¶

Thresholds + alerting rules on M₁ alone. No amount of threshold engineering on runq.latency will distinguish the two pathologies; they produce identical signals. The problem is structural, not tuning.
Application-level latency. Tells you there's a problem but not at which layer; still can't tell noisy-neighbor from self- throttling because the application just sees slow CPU.
A proxy metric (CPU steal, cgroup throttled_time). Works as a partial fix for the self-throttling side — cpu.stat exposes throttling counts directly — but you still need a separate signal for cross-cgroup preemption, and pairing it consistently with runq.latency is exactly the dual-metric pattern.
Per-cause bespoke dashboards. Encodes the same pair-wise interpretation socially instead of structurally. Doesn't scale to new on-calls; the pair should emit as a pair.

Generalisation axis¶

Wherever a single metric covers multiple failure modes, the same disambiguation design applies:

Headline metric	Causes that elevate it	Disambiguating second metric
`runq.latency`	noisy neighbor / self-throttling	`sched.switch.out` tagged by preempting cgroup class
p99 request latency	backend slow / queue depth / GC pause	per-span breakdown + GC-pause counter
disk p99 read latency	noisy neighbor on shared media / own I/O burst	per-tenant IO-depth gauge + device-busy tag
connection pool wait	app slow / pool undersized	pool-acquire-time + pool-in-use gauge

Seen in¶

sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf — canonical instance: runq.latency paired with preempt-cause-tagged sched.switch.out to distinguish cross-cgroup noisy neighbor from self-imposed CPU-quota throttling.