Skip to content

SYSTEM Cited by 3 sources

Zalando Adaptive Paging

Adaptive Paging is Zalando's in-house alert-routing primitive: an alert handler that monitors the error rate of Critical Business Operations (CBOs) and, when a CBO alert fires, traverses the live distributed-tracing graph to identify the team closest to the root cause — paging them instead of the alert owner. A single symptom-level alerting rule fans out into per-firing paging decisions driven by trace-graph heuristics.

Origin

Built by the Zalando SRE team in 2019 as the first novel capability on top of the fleet-wide OpenTracing rollout. Presented at SRECon'19 EMEA by Vasco Mineiro (usenix.org/conference/srecon19emea/presentation/mineiro). First named in a blog post 2020-10-07; the design + motivation canonicalised in 2021-09-20 Part II of the SRE journey series.

How it works

  1. Single CBO alerting rule — e.g., "checkout error rate > 0.5% for 5 minutes."
  2. On alert fire, pull the subset of traces matching the CBO during the alerting window.
  3. Traverse each trace's span graph to identify spans carrying error=true, non-2xx status, or downstream- dependency failures using OpenTracing semantic conventions.
  4. Score services by likely-root-cause — heuristic combinations include: deepest failed span, earliest failed timestamp, span carrying originating DB or RPC error, etc.
  5. Look up the team owning the identified service via an internal service-to-team map.
  6. Page that team — not the alert owner, not a generic SRE rotation.

The heuristic is probabilistic — routing can be wrong. Accepted tradeoff: imperfect routing that's ~70%+ accurate vastly beats routing every symptom-alert to the SRE team and forcing them to re-page.

Why it matters

Adaptive Paging is the routing primitive that makes symptom-based alerting operationally feasible at scale. Without adaptive paging, a symptom-level CBO alert has no obvious pager target — every team that touches checkout might get paged, or the SRE team gets paged and re-pages, or a static service-owner mapping misroutes when the root cause is deep in the trace.

With adaptive paging, one CBO alert produces one paging decision to the team whose service is most likely responsible, extracted from the actual trace of the actual failed requests.

Contents of a paging-target decision

  • CBO identity: "checkout_v2"
  • Candidate failed spans: ranked list of spans + their failure evidence
  • Chosen service: the highest-ranked candidate
  • Chosen team: looked up from service-to-team map
  • Page context: link into the trace for the on-call to investigate

Prerequisites

  • Fleet-wide distributed tracing OpenTracing (or OpenTelemetry) deployed on every service in scope, with consistent error-tagging semantic conventions.
  • A named CBO catalogueconcepts/critical-business-operation.
  • Per-request trace sampling that captures error paths — tail-sampling or error-biased sampling typically required so the routing decision has actual failed traces to look at.
  • A service-to-team map — can be a registry, labels in Kubernetes, or a config file — must be kept current.
  • A Tracing API to pull trace graphs by CBO + time range at alert-fire time — Zalando's is the explicit load-bearing primitive enabling Adaptive Paging. "The standardized data model [...] and an API to consume the tracing data allowed the SRE team to build additional value from it."

Evolution at Zalando

  • 2018: Distributed Tracing ( OpenTracing) rolled out fleet-wide as the SRE Program's flagship 2018 initiative; Zalando-specific Semantic Conventions published.
  • 2019: Adaptive Paging first Zalando-proprietary ops primitive built on top of the tracing API. Released with the Zalando CBO taxonomy.
  • 2019: Used to drive Symptom-Based Alerting strategy rollout.
  • SRECon'19 EMEA: public talk by Mineiro detailing the heuristics and internal implementation.
  • 2020-10-07: first published externally in Koutsiaris's Cyber Week retrospective blog post, framed as a key alert-fatigue reduction.
  • 2021-09-20: Part II of the SRE journey series canonicalises Adaptive Paging as one of four tracing- derived platform capabilities (alongside Throughput Calculator, SLO reporting, and Operation-Based SLOs).
  • 2022-04-27: MWMBR upgrade canonicalised. In the Operation-Based SLOs post (sources/2022-04-27-zalando-operation-based-slos), Adaptive Paging's trigger predicate evolves from "CBO error rate > SLO threshold" (v1) to "CBO error-budget burn rate exceeds multi-window multi-burn-rate thresholds" (v2). Motivating failure mode: short-lived error spikes were paging on-call, with teams adding ad-hoc fine-tuning guards (time-of-day, throughput, duration). MWMBR eliminates both the false positives and the need for fine-tuning in one change — "engineering teams required no effort to set up and manage these alerts." Routing logic (trace-graph traversal → team closest to root cause) unchanged. Alert thresholds now derived automatically from the SLO's error budget via Service Level Management Tool. Dogfood numbers: FP rate 56% → 0%, alerts 2 → 0.14/day.

Public references

Seen in

Last updated · 501 distilled / 1,218 read