ZALANDO 2022-04-27

Zalando — Operation Based SLOs¶

Summary¶

João Oliveirinha (Zalando SRE, 2022-04-27) publishes the technical deep-dive companion to the 2021-09-20 Tracing SRE's Journey — Part II retrospective. Part II announced Zalando's pivot away from service-based SLOs toward Operation-Based SLOs; this post is the full operational write-up — how they are defined, what tooling evaluates them, how they're embedded in the development process, and how the SRE team itself dogfooded the framework for 3 months to prove it. The article frames SLOs with the formula SLO = Symptom + Target, so a correctly- chosen CBO is simultaneously the symptom-level alert and the SLO, making the symptom-based alerting strategy trivially derivable from the SLO. Zalando ships the framework inside a new Service Level Management tool (systems/zalando-service-level-management-tool) — the operation-based successor to the earlier DX-scoped SLO Reporting Tool — and evolves Adaptive Paging to trigger on multi-window multi-burn-rate error-budget thresholds rather than a raw error rate, eliminating both short-lived-spike false positives and the need for per-alert fine-tuning. The measured dogfood result (SRE department, Q4 2021, 3 months): 56% → 0% false- positive rate, 2 → 0.14 alerts/day, 30+ alerts disabled, zero user-facing incidents missed. The decisive governance change is top-down ownership: each CBO's SLO is signed off by a senior manager (Director / VP) owning the customer experience the CBO realises, not by the team operating any single component service — which gives the symptom-based-alert threshold political backing to survive when individual teams push back.

Key takeaways¶

SLO = Symptom + Target is the load-bearing formula. Zalando derives its whole ops-alerting stack from one identity: "if we capture high level signals (or symptoms) that represented customer interactions [...] and couple that with an SLO, we get our alert threshold implicitly." A CBO is the symptom; its SLO target is the alert threshold. This closes the loop between CBOs, Operation-Based SLOs, and Symptom-Based Alerting: three concepts that Zalando deliberately co-designs as a single stack (Source: sources/2022-04-27-zalando-operation-based-slos).
Service-Based SLOs failed at three distinct scales — all documented. Zalando ran service-based SLOs from 2016 with the SLO Reporting tool (and later the Tier-classified version from 2018). The three failure modes are named verbatim: (a) high number of microservices → high number of SLOs to monitor, review, fine tune; (b) mapping microservice SLOs to products and their expectations — SLOs easily conflict with each other when products share services; (c) SLOs on a fine grained level made it challenging for management to align on them — "management support beyond the team level is difficult to get", while team-level adoption requires costly cross-team alignment. These are the generalisable dead-ends other orgs hit; worth quoting in any SLO-design doc (Source: this article).
The list of User Functions (later renamed CBOs) came from Cyber-Week load-testing work, not SLO work. The same list of business-critical operations Zalando's load-testers had been generating ordered-by-revenue-impact for Cyber Week prep became the seed for (a) which operations to instrument with Distributed Tracing first; and (b) which operations became the first CBOs. The article's explicit framing: "the criticality argument was also valid to guide our instrumentation efforts." Canonicalises the pattern of Cyber-Week being the capability-forcing function for downstream SRE primitives.
Graceful degradation exposes the limits of HTTP-status based availability SLIs. The article walks through the "first fallback / second fallback" example: a response that was successful from the HTTP perspective (200 OK, the second fallback returned a reduced-quality response) can still be a conceptual SLO failure. "Even though the response was successful from the client's perspective, we still count it as an error." The pivot: availability SLOs stop being 5xx-rate and start being OpenTracing error-tag-rate — a transport-agnostic signal set by application code whenever it took the poor-quality fallback. Crucial enabler for operation-based SLOs that cross protocol boundaries (Source: this article).
The Adaptive Paging alert handler uses the trace causality graph at alert time to route the page. "When this alert handler is triggered, it reads the tracing data to determine where the error comes from across the entire distributed system, and pages the team that is closest to the problem." Solves the "if the symptom-alert fires on 15-service checkout, who do we page?" gap without either (a) overloading one team with every symptom, or (b) adding a triage hop through SRE. The canonical Zalando phrasing: "by taking Adaptive Paging, and having it monitor an edge operation, we achieved a viable and sustainable implementation of Symptom Based Alerting."
CBO SLO ownership is top-down by Director/VP, not bottom- up by team. Zalando explicitly inverted the SLO-ownership model: a CBO's SLO is owned by the senior manager responsible for the customer-experience domain (e.g., a senior manager owning Checkout + Sales Orders for the "Place Order" CBO) — who has budget authority over the teams implementing the operation. "This also ensured the SLO had management support." Without this the symptom- alert is politically unstable: teams push back when paged on "not our service" symptoms. Top-down ownership converts the CBO-level alert from an operational nuisance into a reliability directive with executive air cover.
Multi-Window Multi-Burn-Rate alerting eliminated fine-tuning and short-spike false positives in one change. The initial CBO-rollout produced complaints about sensitivity — short- lived error spikes paged on-call, leading to ad-hoc workarounds (time-of-day guards, throughput gates, duration thresholds — classic false-positive spaghetti). Zalando adopted the Google SRE Workbook's MWMBR strategy: fire on error-budget burn rate across multiple time windows simultaneously (fast-burn catches outages; slow-burn catches silent regression). Net outcome: "engineering teams required no effort to set up and manage these alerts" — because the threshold is derived purely from the SLO's error budget and standard multi-window fractions, not team intuition.
The dogfood result is the headline number. For 3 months Zalando's SRE department applied the Operation-Based SLO framework to its own services (CBOs like "Ingest Metrics", "Query Traces"). Measured before/after in a weekly operational review that curated which cause-based alerts could be safely retired: false-positive rate 56% → 0%, alert workload 2 → 0.14 alerts/day, 30+ cause-based alerts disabled, zero user-facing incidents missed. On- call became so quiet they kept up Wheel-of-Misfortune sessions to preserve on-call muscle memory. Proves the framework's claim numerically — critical because earlier rhetorical justifications had failed to get teams to disable their cause-based alerts (Source: this article).
Longevity argument: operation names outlive service architectures. "View Product Details is something that has always existed in the company's history, but as a feature it has gone through different services and architectures implementing it." A service-based SLO evaporates at every re-architecture; an operation-based SLO survives. Concrete argument for preferring operations as the persistent SLO axis even when the microservice landscape churns.
Impact communication is easier at operation altitude. "50% error rate in Service Foo is not easily translatable to customer or business impact, without deep understanding of the service landscape. A 50% error rate on 'Add to cart' is much clearer to communicate and derive urgency." Operation-based numbers are directly legible to product, finance, and executive leadership. Underappreciated organisational benefit of the pivot — reliability discussions no longer require translating service metrics through an ops glossary.

Systems named¶

systems/zalando-service-level-management-tool — the new operation-based SLO tool built to succeed the 2018 SLO Reporting Tool. Tracks CBO-level SLOs, displays error-budget consumption across multiple 28-day windows, drives Adaptive Paging's MWMBR thresholds. First named canonically in this post; screenshot shown with operation names "Place Order" / "View Catalog" rather than service names.
systems/zalando-adaptive-paging — evolved from the 2019 first-generation (fire when error rate > SLO) to the 2022-era MWMBR version (fire when error-budget burn rate exceeds multi-window thresholds). Still routes via trace- graph traversal to team-closest-to-root-cause.
systems/zalando-slo-reporting-tool — the earlier service-based tool, DX-scoped since 2018. This post is where Zalando explicitly canonicalises the supersession: the new tool's operation-based model replaces the old one's service + Tier-classification model.
systems/opentracing — named again as the substrate whose error tag enables transport-agnostic SLI measurement; semantic conventions make availability SLIs survive the move beyond 5xx-counting.
systems/google-sre-book — explicitly cited as the source of both the SLI/SLO primitive and the MWMBR alerting strategy; Zalando's adoption is presented as a faithful instantiation of the Google SRE philosophy, not a novel invention.

Concepts named¶

Operation-Based SLO — the post's central subject. This article is the canonical technical deep-dive for the concept on the wiki (the 2021-09 Part II only announces the pivot).
Critical Business Operation (CBO) — renamed from internal "User Functions" to encompass more-than-strictly-user operations; the alertable unit the whole stack is built on.
Symptom-Based Alerting — positioned as *derivable_ from CBO + SLO via SLO = Symptom + Target, not as a separate policy pick.
Adaptive Paging — the routing primitive whose MWMBR evolution this post canonicalises.
Service Tier classification — the 2018 structure that pre-dated and coexisted with operation-based SLOs; post explains how the Tier axis is subsumed by operation-altitude SLOs.
SLO + [[concepts/ service-level-indicator|SLI]] — Google SRE-book primitives Zalando inherits; article's contribution is the shift from service-keyed to operation-keyed SLIs.
Error Budget — canonicalised by this post as the load-bearing alert-threshold-driver once MWMBR replaces raw-rate thresholds. First canonical Zalando wiki instance.
Multi-Window Multi-Burn-Rate (MWMBR) alerting — the Google SRE Workbook alerting strategy Zalando adopted in ~2021 to eliminate short-spike false positives and per-alert fine- tuning. First canonical wiki instance.
Alert fatigue — the problem adaptive-paging-plus-MWMBR is targeted at. Dogfood numbers are a direct before/after measurement.
Graceful degradation — the "first fallback / second fallback" scenario justifies transport-agnostic SLIs via OpenTracing tags. First canonical wiki instance as a named concept.

Patterns named¶

patterns/dogfood-as-adoption-proof — the 3-month SRE- department dogfood is not just validation; it's the adoption lever that finally got other teams to disable their cause- based alerts. Named pattern: when a new ops framework stalls because teams won't trust it, the owning team dogfoods, measures, and publishes the before/after numbers as the adoption-lever of last resort. First canonical wiki instance.
patterns/unified-sre-team-over-federated — the single-team structure that's the organisational prerequisite for doing a multi-quarter dogfood at all. Federated SRE cannot coordinate a cross-department adoption campaign with dogfood measurements.

Operational numbers¶

Dogfood duration: 3 months (SRE department Q4 2021).
False-positive rate: 56% → 0% over the trial.
Alert workload: 2 → 0.14 alerts/day (≈93% reduction).
Alerts disabled: 30+ cause-based alerts retired across SRE's own services.
Alerts missed during trial: 0 user-facing incidents.
Operation-Based SLO screenshot: Zalando's new Service Level Management tool displays error-budget over three 28-day windows (standard Google SRE-book window).
Service-landscape scale (Zalando 2022): >4,000 microservices (from Part II; still the frame of reference).
SRE team size (2019 post-merger, still current in 2022): 7 people in the unified DF SRE team.

Caveats¶

Latency SLOs are still unreleased. Zalando explicitly calls this out: "Right now, CBOs only set Availability targets. We also want CBO owners to define latency targets." Held off because adaptive paging's current algorithm cannot route latency alerts without burdening the edge team. Open problem.
Event-based systems are not covered. The whole MWMBR + Adaptive Paging stack presumes RPC causality (request→span tree). Event-driven paths lose the causality property, so "the loss of the causality property reduces the usefulness of our Adaptive Paging algorithm." Major unaddressed gap for the non-synchronous parts of Zalando's stack (e.g. Nakadi consumers).
Non-edge customer operations are not covered. CBOs are still restricted to edge operations. Some customer- relevant operations live deep in call chains; adding them naively would balloon the CBO catalogue. "Well defined criteria needs to be in place to properly identify and onboard these operations." No such criteria yet published.
No numbers on CBO taxonomy size. How many CBOs does Zalando have after 2+ years of rollout? Authoring + retiring process not disclosed. (Same gap as Part II.)
SRE-department dogfood was small. The trial was inside the same org that built the framework — motivated dogfood, not independent validation. The post itself flags the wider rollout is still "not done yet": other departments still run cause-based alerts.
MWMBR thresholds are not reproduced. The post links the Google SRE Workbook chapter but does not reproduce the specific (window, burn_rate) pairs Zalando uses. Treated as a well-known strategy, but the exact thresholds chosen at Zalando are left implicit.
Service-based SLO tool is not turned off. Even teams that adopt CBOs still run the older SLO Reporting Tool
cause-based alerts in parallel. "Even teams that did adopt CBOs, weren't disabling their cause based alerts." The dogfood numbers come from convincing those teams. So "superseded" is aspirational in the tooling layer.
Top-down ownership's political costs are not discussed. The article treats VP/Director sign-off as straightforward; in practice this is a governance intervention that requires SRE-program political capital to negotiate. No disclosure of how many VP conversations it took or what pushback looked like.
No Adaptive Paging heuristic disclosures. Same as Part II: the trace-graph-traversal heuristics for "team closest to problem" remain opaque. The SRECon'19 EMEA Mineiro talk is the authoritative reference; the blog posts don't reproduce it.
Post is framed as Zalando-specific, not prescriptive. The closing paragraph explicitly declines to claim operation- based is universally better: "operation based SLOs and service based SLOs as different implementations of SLOs. Depending on your organization, and/or architecture, one implementation or the other may work better for you." Fairly rare humility in an SRE-advocacy post.

Source¶

Original: https://engineering.zalando.com/posts/2022/04/operation-based-slos.html
Raw markdown: raw/zalando/2022-04-27-operation-based-slos-2eb6ecda.md

companies/zalando
— the retrospective this post deep-dives.
— where the list-of-User-Functions originated.
concepts/operation-based-slo · concepts/critical-business-operation · concepts/symptom-based-alerting · concepts/adaptive-paging · concepts/service-tier-classification · concepts/service-level-objective · concepts/service-level-indicator · concepts/error-budget · concepts/multi-window-multi-burn-rate · concepts/alert-fatigue · concepts/graceful-degradation
systems/zalando-service-level-management-tool · systems/zalando-adaptive-paging · systems/zalando-slo-reporting-tool · systems/opentracing · systems/google-sre-book
patterns/dogfood-as-adoption-proof · patterns/unified-sre-team-over-federated