Skip to content

CONCEPT Cited by 2 sources

Adaptive paging

Adaptive paging is an alert-routing technique that uses live distributed-tracing causality data at alert time to page the team most likely responsible for the root cause, instead of paging the static owner of the alerting rule. A single alerting rule fans out into per-incident paging decisions driven by heuristics over the current trace graph.

Definition

Zalando describes it directly (2020):

"Distributed tracing has been a game-changer for us. We have leveraged tracing data to reduce alert fatigue of our on-call teams through an approach called adaptive paging. It's an alert handler that leverages the causality from tracing and OpenTracing's semantic conventions to page the team closest the problem. From a single alerting rule, a set of heuristics is applied to identify the most probable cause, paging the respective team instead of the alert owner." — Koutsiaris, How Zalando prepares for Cyber Week ().

Three defining properties:

  1. Single alerting rule → N possible paging targets. The rule itself is symptom-level ("p99 of checkout high", "error rate on shop breached"); the paging target is computed per-firing.
  2. Trace causality is the input. The decision uses parent- child span relationships, failed-downstream signals, and OpenTracing semantic conventions (http.status_code, error, db.statement) from the spans active during the alert window.
  3. Heuristics, not rules. The paging decision is a probabilistic "most probable cause" output, not a deterministic lookup. Accepted tradeoff: imperfect routing beats on-call team fatigue from routing every alert to the same generalist team.

Why this matters

Classical paging routes an alert to either (a) the owner of the alerting rule (usually a platform / SRE team) or (b) a statically-mapped service-to-team table. Both fail at scale:

  • Static ownership of alert rules routes every symptom to the SRE / platform team, who then have to re-page the correct service team → re-wake pattern → burnout.
  • Static service-to-team tables don't know that this particular slow-checkout is caused by an inventory-service degradation three hops downstream; it pages the checkout team who has nothing to fix.

Adaptive paging uses the live trace as the root-cause oracle: the span with the earliest / deepest / highest-cost failure signal in the trace graph is a far better predictor of the responsible team than any static mapping.

Prerequisites

  • Distributed tracing deployed broadly enough that a given request's failure path lands within the traced fleet. Gaps in instrumentation cause the heuristic to point at the nearest-traced team instead of the actual root cause. See Zalando's tier-gated rollout as one sequencing strategy.
  • OpenTracing or OpenTelemetry semantic conventions applied consistently. Without uniform error=true, http.status_code, standard tag names, the heuristic has nothing to score.
  • A service-to-team map covering all services whose spans can appear in a trace. The heuristic outputs "this span, on this service" — a lookup is still needed from there.
  • Low-enough trace-sampling bias on error paths that the trace whose causality drives paging is actually captured. Tail-sampling or error-biased sampling is typical.

Contrast: alert routing vs root-cause analysis

Adaptive paging is on the alert path, not the post-incident analysis path. It makes a paging-time decision on imperfect information; root-cause analysis happens later with full context. The heuristics can be wrong — the point is that even a 70%-correct routing signal is dramatically better than routing every alert to one team.

  • Correlation-based paging in APM tools — DataDog, New Relic, Honeycomb all ship similar primitives under different names ("intelligent alerting", "anomaly correlation"). The Zalando post predates many of these.
  • LLM-assisted incident triage — 2024+ systems that post a GenAI suggestion of responsible team. Adaptive paging is the deterministic-heuristic ancestor of that approach; both consume the same trace + telemetry substrate.

Seen in

  • — canonical named description. Used by Zalando as a key reduction in on-call alert fatigue during Cyber Week.
  • — expands Adaptive Paging's design: monitors Critical Business Operation error rate, traverses trace graph to identify root-cause service, routes page to owning team. Positions Adaptive Paging as the routing primitive that enables Symptom-Based Alerting — one CBO alert → per-firing paging decision via trace-graph traversal. Presented at SRECon'19 EMEA (Mineiro). Canonical system page: systems/zalando-adaptive-paging.
  • sources/2022-04-27-zalando-operation-based-sloscanonicalises the MWMBR upgrade. Adaptive Paging's trigger predicate evolved from "CBO error rate > SLO threshold" to "CBO error-budget burn rate exceeds MWMBR thresholds" via the Google SRE Workbook strategy — eliminating short-lived-spike false positives and the need for per-alert fine-tuning (time-of-day gates, throughput guards). Routing logic (trace-graph traversal → team closest to root cause) unchanged; only the trigger became MWMBR. See concepts/multi-window-multi-burn-rate for the threshold strategy. Dogfood numbers (3-month SRE- dept trial): FP rate 56%→0%, alerts 2→0.14/day.
  • sources/2021-10-14-zalando-tracing-sres-journey-part-iii — narrates why the MWMBR upgrade was needed. First iteration used SLO target as paging threshold; false positives on short-lived spikes forced engineers to add per-alert hacks (time-of-day / throughput / error-rate- length criteria), re-creating the tuning burden the SLO was meant to eliminate. MWMBR replaced SLO-as-threshold; window lengths + thresholds now derived automatically from the SLO. See patterns/slo-derived-alert-rule-generation.
Last updated · 542 distilled / 1,571 read