Skip to content

CONCEPT Cited by 1 source

Adaptive paging

Adaptive paging is an alert-routing technique that uses live distributed-tracing causality data at alert time to page the team most likely responsible for the root cause, instead of paging the static owner of the alerting rule. A single alerting rule fans out into per-incident paging decisions driven by heuristics over the current trace graph.

Definition

Zalando describes it directly (2020):

"Distributed tracing has been a game-changer for us. We have leveraged tracing data to reduce alert fatigue of our on-call teams through an approach called adaptive paging. It's an alert handler that leverages the causality from tracing and OpenTracing's semantic conventions to page the team closest the problem. From a single alerting rule, a set of heuristics is applied to identify the most probable cause, paging the respective team instead of the alert owner." — Koutsiaris, How Zalando prepares for Cyber Week (sources/2020-10-07-zalando-how-zalando-prepares-for-cyber-week).

Three defining properties:

  1. Single alerting rule → N possible paging targets. The rule itself is symptom-level ("p99 of checkout high", "error rate on shop breached"); the paging target is computed per-firing.
  2. Trace causality is the input. The decision uses parent- child span relationships, failed-downstream signals, and OpenTracing semantic conventions (http.status_code, error, db.statement) from the spans active during the alert window.
  3. Heuristics, not rules. The paging decision is a probabilistic "most probable cause" output, not a deterministic lookup. Accepted tradeoff: imperfect routing beats on-call team fatigue from routing every alert to the same generalist team.

Why this matters

Classical paging routes an alert to either (a) the owner of the alerting rule (usually a platform / SRE team) or (b) a statically-mapped service-to-team table. Both fail at scale:

  • Static ownership of alert rules routes every symptom to the SRE / platform team, who then have to re-page the correct service team → re-wake pattern → burnout.
  • Static service-to-team tables don't know that this particular slow-checkout is caused by an inventory-service degradation three hops downstream; it pages the checkout team who has nothing to fix.

Adaptive paging uses the live trace as the root-cause oracle: the span with the earliest / deepest / highest-cost failure signal in the trace graph is a far better predictor of the responsible team than any static mapping.

Prerequisites

  • Distributed tracing deployed broadly enough that a given request's failure path lands within the traced fleet. Gaps in instrumentation cause the heuristic to point at the nearest-traced team instead of the actual root cause. See Zalando's tier-gated rollout as one sequencing strategy.
  • OpenTracing or OpenTelemetry semantic conventions applied consistently. Without uniform error=true, http.status_code, standard tag names, the heuristic has nothing to score.
  • A service-to-team map covering all services whose spans can appear in a trace. The heuristic outputs "this span, on this service" — a lookup is still needed from there.
  • Low-enough trace-sampling bias on error paths that the trace whose causality drives paging is actually captured. Tail-sampling or error-biased sampling is typical.

Contrast: alert routing vs root-cause analysis

Adaptive paging is on the alert path, not the post-incident analysis path. It makes a paging-time decision on imperfect information; root-cause analysis happens later with full context. The heuristics can be wrong — the point is that even a 70%-correct routing signal is dramatically better than routing every alert to one team.

  • Correlation-based paging in APM tools — DataDog, New Relic, Honeycomb all ship similar primitives under different names ("intelligent alerting", "anomaly correlation"). The Zalando post predates many of these.
  • LLM-assisted incident triage — 2024+ systems that post a GenAI suggestion of responsible team. Adaptive paging is the deterministic-heuristic ancestor of that approach; both consume the same trace + telemetry substrate.

Seen in

Last updated · 476 distilled / 1,218 read