SYSTEM Cited by 3 sources
Zalando Adaptive Paging¶
Adaptive Paging is Zalando's in-house alert-routing primitive: an alert handler that monitors the error rate of Critical Business Operations (CBOs) and, when a CBO alert fires, traverses the live distributed-tracing graph to identify the team closest to the root cause — paging them instead of the alert owner. A single symptom-level alerting rule fans out into per-firing paging decisions driven by trace-graph heuristics.
Origin¶
Built by the Zalando SRE team in 2019 as the first novel capability on top of the fleet-wide OpenTracing rollout. Presented at SRECon'19 EMEA by Vasco Mineiro (usenix.org/conference/srecon19emea/presentation/mineiro). First named in a blog post 2020-10-07; the design + motivation canonicalised in 2021-09-20 Part II of the SRE journey series.
How it works¶
- Single CBO alerting rule — e.g., "checkout error rate > 0.5% for 5 minutes."
- On alert fire, pull the subset of traces matching the CBO during the alerting window.
- Traverse each trace's span graph to identify spans
carrying
error=true, non-2xx status, or downstream- dependency failures using OpenTracing semantic conventions. - Score services by likely-root-cause — heuristic combinations include: deepest failed span, earliest failed timestamp, span carrying originating DB or RPC error, etc.
- Look up the team owning the identified service via an internal service-to-team map.
- Page that team — not the alert owner, not a generic SRE rotation.
The heuristic is probabilistic — routing can be wrong. Accepted tradeoff: imperfect routing that's ~70%+ accurate vastly beats routing every symptom-alert to the SRE team and forcing them to re-page.
Why it matters¶
Adaptive Paging is the routing primitive that makes symptom-based alerting operationally feasible at scale. Without adaptive paging, a symptom-level CBO alert has no obvious pager target — every team that touches checkout might get paged, or the SRE team gets paged and re-pages, or a static service-owner mapping misroutes when the root cause is deep in the trace.
With adaptive paging, one CBO alert produces one paging decision to the team whose service is most likely responsible, extracted from the actual trace of the actual failed requests.
Contents of a paging-target decision¶
- CBO identity: "checkout_v2"
- Candidate failed spans: ranked list of spans + their failure evidence
- Chosen service: the highest-ranked candidate
- Chosen team: looked up from service-to-team map
- Page context: link into the trace for the on-call to investigate
Prerequisites¶
- Fleet-wide distributed tracing — OpenTracing (or OpenTelemetry) deployed on every service in scope, with consistent error-tagging semantic conventions.
- A named CBO catalogue — concepts/critical-business-operation.
- Per-request trace sampling that captures error paths — tail-sampling or error-biased sampling typically required so the routing decision has actual failed traces to look at.
- A service-to-team map — can be a registry, labels in Kubernetes, or a config file — must be kept current.
- A Tracing API to pull trace graphs by CBO + time range at alert-fire time — Zalando's is the explicit load-bearing primitive enabling Adaptive Paging. "The standardized data model [...] and an API to consume the tracing data allowed the SRE team to build additional value from it."
Evolution at Zalando¶
- 2018: Distributed Tracing ( OpenTracing) rolled out fleet-wide as the SRE Program's flagship 2018 initiative; Zalando-specific Semantic Conventions published.
- 2019: Adaptive Paging first Zalando-proprietary ops primitive built on top of the tracing API. Released with the Zalando CBO taxonomy.
- 2019: Used to drive Symptom-Based Alerting strategy rollout.
- SRECon'19 EMEA: public talk by Mineiro detailing the heuristics and internal implementation.
- 2020-10-07: first published externally in Koutsiaris's Cyber Week retrospective blog post, framed as a key alert-fatigue reduction.
- 2021-09-20: Part II of the SRE journey series canonicalises Adaptive Paging as one of four tracing- derived platform capabilities (alongside Throughput Calculator, SLO reporting, and Operation-Based SLOs).
- 2022-04-27: MWMBR upgrade canonicalised. In the Operation-Based SLOs post (sources/2022-04-27-zalando-operation-based-slos), Adaptive Paging's trigger predicate evolves from "CBO error rate > SLO threshold" (v1) to "CBO error-budget burn rate exceeds multi-window multi-burn-rate thresholds" (v2). Motivating failure mode: short-lived error spikes were paging on-call, with teams adding ad-hoc fine-tuning guards (time-of-day, throughput, duration). MWMBR eliminates both the false positives and the need for fine-tuning in one change — "engineering teams required no effort to set up and manage these alerts." Routing logic (trace-graph traversal → team closest to root cause) unchanged. Alert thresholds now derived automatically from the SLO's error budget via Service Level Management Tool. Dogfood numbers: FP rate 56% → 0%, alerts 2 → 0.14/day.
Public references¶
- Blog post (2020-10-07): Zalando Engineering — How Zalando prepares for Cyber Week.
- SRECon'19 EMEA talk (2019-10): Mineiro, Adaptive Paging: Paging the Right Team Based on Tracing Data.
- Blog post (2021-09-20): Zalando Engineering — Tracing SRE's Journey Part II — places Adaptive Paging in the context of the CBO / Symptom-Based Alerting / Operation-Based SLO stack.
- Blog post (2022-04-27): Zalando Engineering — Operation Based SLOs — canonicalises the MWMBR evolution and publishes the dogfood numbers.
Seen in¶
- sources/2020-10-07-zalando-how-zalando-prepares-for-cyber-week — first public mention; framed as alert-fatigue reduction.
- sources/2021-09-20-zalando-tracing-sres-journey-part-ii — canonicalises the full design: CBO-level alerting + trace-graph traversal + SRECon'19 reference; explicitly positions Adaptive Paging as the enabler for Symptom-Based Alerting.
- sources/2022-04-27-zalando-operation-based-slos —
canonicalises the MWMBR upgrade. Trigger predicate
evolves from raw error rate > SLO to error-budget burn-rate
multi-window thresholds. Dogfood numbers (3 months, SRE department): FP rate 56%→0%, alerts 2→0.14/day, 30+ cause-based alerts disabled.
Related¶
- concepts/adaptive-paging — the generalised concept.
- concepts/critical-business-operation — the alertable primitive this system consumes.
- concepts/symptom-based-alerting — the alerting-strategy this system enables.
- concepts/alert-fatigue — the problem this system is designed to reduce.
- concepts/error-budget — the primitive the 2022-era MWMBR trigger consumes.
- concepts/multi-window-multi-burn-rate — the 2022-era trigger strategy.
- systems/opentracing · systems/opentelemetry
- systems/zalando-service-level-management-tool — provides the SLO + error-budget data Adaptive Paging's MWMBR trigger consumes.
- companies/zalando