CONCEPT Cited by 1 source
Adaptive paging¶
Adaptive paging is an alert-routing technique that uses live distributed-tracing causality data at alert time to page the team most likely responsible for the root cause, instead of paging the static owner of the alerting rule. A single alerting rule fans out into per-incident paging decisions driven by heuristics over the current trace graph.
Definition¶
Zalando describes it directly (2020):
"Distributed tracing has been a game-changer for us. We have leveraged tracing data to reduce alert fatigue of our on-call teams through an approach called adaptive paging. It's an alert handler that leverages the causality from tracing and OpenTracing's semantic conventions to page the team closest the problem. From a single alerting rule, a set of heuristics is applied to identify the most probable cause, paging the respective team instead of the alert owner." — Koutsiaris, How Zalando prepares for Cyber Week (sources/2020-10-07-zalando-how-zalando-prepares-for-cyber-week).
Three defining properties:
- Single alerting rule → N possible paging targets. The rule itself is symptom-level ("p99 of checkout high", "error rate on shop breached"); the paging target is computed per-firing.
- Trace causality is the input. The decision uses parent-
child span relationships, failed-downstream signals, and
OpenTracing semantic conventions
(
http.status_code,error,db.statement) from the spans active during the alert window. - Heuristics, not rules. The paging decision is a probabilistic "most probable cause" output, not a deterministic lookup. Accepted tradeoff: imperfect routing beats on-call team fatigue from routing every alert to the same generalist team.
Why this matters¶
Classical paging routes an alert to either (a) the owner of the alerting rule (usually a platform / SRE team) or (b) a statically-mapped service-to-team table. Both fail at scale:
- Static ownership of alert rules routes every symptom to the SRE / platform team, who then have to re-page the correct service team → re-wake pattern → burnout.
- Static service-to-team tables don't know that this particular slow-checkout is caused by an inventory-service degradation three hops downstream; it pages the checkout team who has nothing to fix.
Adaptive paging uses the live trace as the root-cause oracle: the span with the earliest / deepest / highest-cost failure signal in the trace graph is a far better predictor of the responsible team than any static mapping.
Prerequisites¶
- Distributed tracing deployed broadly enough that a given request's failure path lands within the traced fleet. Gaps in instrumentation cause the heuristic to point at the nearest-traced team instead of the actual root cause. See Zalando's tier-gated rollout as one sequencing strategy.
- OpenTracing or
OpenTelemetry semantic conventions
applied consistently. Without uniform
error=true,http.status_code, standard tag names, the heuristic has nothing to score. - A service-to-team map covering all services whose spans can appear in a trace. The heuristic outputs "this span, on this service" — a lookup is still needed from there.
- Low-enough trace-sampling bias on error paths that the trace whose causality drives paging is actually captured. Tail-sampling or error-biased sampling is typical.
Contrast: alert routing vs root-cause analysis¶
Adaptive paging is on the alert path, not the post-incident analysis path. It makes a paging-time decision on imperfect information; root-cause analysis happens later with full context. The heuristics can be wrong — the point is that even a 70%-correct routing signal is dramatically better than routing every alert to one team.
Related mechanisms not named in the source¶
- Correlation-based paging in APM tools — DataDog, New Relic, Honeycomb all ship similar primitives under different names ("intelligent alerting", "anomaly correlation"). The Zalando post predates many of these.
- LLM-assisted incident triage — 2024+ systems that post a GenAI suggestion of responsible team. Adaptive paging is the deterministic-heuristic ancestor of that approach; both consume the same trace + telemetry substrate.
Seen in¶
- sources/2020-10-07-zalando-how-zalando-prepares-for-cyber-week — canonical named description. Used by Zalando as a key reduction in on-call alert fatigue during Cyber Week.
Related¶
- concepts/alert-fatigue — the problem adaptive paging is engineered to solve.
- concepts/observability
- systems/opentracing · systems/opentelemetry
- concepts/traffic-source-tagging-in-traces