Skip to content

CONCEPT Cited by 2 sources

Operation-based SLO

An Operation-Based SLO is a Service Level Objective defined per business operation / user journey (e.g., "checkout success rate ≥ 99.9%", "search latency p99 ≤ 500ms") rather than per individual service. It is a direct consequence of CBO taxonomy + fleet- wide distributed tracing: once the root span of a user-facing operation carries success/failure attribution across N services, its SLO is measurable without summing per-service availabilities.

Definition

"Finally, through our use of Distributed Tracing, and Adaptive Paging, we made a significant change in our SLO strategy. We moved away from service based SLOs, and started rolling out Operation based SLOs."

Three defining properties:

  1. Scope is the user-facing operation, not the service — "checkout", "item view", "add to cart", "sign in", each with its own error-budget and target.
  2. SLI is measured at the root span of the traced request, so a failure anywhere in the compound trace (including non-critical downstream) counts as a CBO failure.
  3. Ownership is the operation's orchestrating team, not a single service team — though fault routing uses adaptive paging to reach the actually-responsible service.

The service-based SLO failure mode

Service-based SLOs are the SRE-book default but collapse at microservice scale:

  • Compound availability gap. If a user journey touches 15 services at 99.9% each, compound availability is 0.999^15 ≈ 98.5% — vastly worse than any single SLO suggests. Users experience 1.5% failure; dashboards show all green.
  • Fan-out attribution. A per-service SLO gets "credit" for being available even when its response made the upstream fail (e.g., returns stale data, takes too long, returns 200-wrapping-500). Service X can be 99.99% per its SLO and simultaneously responsible for the CBO's 1% failure rate.
  • Dependency hell for fault budgeting. If a compound user journey's error budget is consumed by service X's slow response, neither the upstream service nor X's service-level SLO flags it — the organisation's error-budget framework fails to allocate blame / guide fixing.
  • Reliability theater. Teams optimise their own SLO without regard to the user journey. A green per-service SLO dashboard is compatible with a user revolt.

How operation-based SLOs fix it

  • The SLI is "was the user's journey successful end-to-end?" measured at the root span. All compound failures surface.
  • The SLO target is set at the user-experience level (99.9% checkout availability) rather than arbitrarily inherited per-service.
  • Error-budget depletion triggers org-wide response — not just the service that happens to be at the tail of the trace.
  • Service-tier classification + operation-based SLOs together give an org the SLO portfolio it needs: the Tier-1 CBOs (checkout, payments, sign-in) have tight SLOs; Tier-3 CBOs have looser ones; individual services no longer need bespoke SLOs.

Prerequisites

  • Distributed tracing deployed on the CBO's full path with error + status attribution on the root span. OpenTracing / OpenTelemetry with canonical semantic conventions.
  • CBO taxonomy — named catalogue of user-facing operations with clear failure definitions. See concepts/critical-business-operation.
  • Service Tier classification — to prioritise which CBOs get SLOs and what the targets should be.
  • Error-budget policy agreed org-wide — feature velocity vs reliability tradeoffs only work if someone enforces the budget across teams touching the operation.
  • Multi-service ownership model — a single service team owning an operation's SLO doesn't work when the operation spans 15 teams; typically a platform / reliability team owns the SLO and uses adaptive paging to route alerts. Zalando's canonical answer is top-down governance — a senior manager (Director / VP) owning the customer experience the CBO realises signs off the SLO and is accountable for it, giving the symptom-alert political backing when teams push back (sources/2022-04-27-zalando-operation-based-slos).

Interaction with adaptive paging

Operation-based SLOs raise a question service-based SLOs did not: "when the SLO is breached, who gets paged?" Service- based SLOs have an implicit answer (page the service team). Operation-based SLOs need a routing layer — this is precisely concepts/adaptive-paging's role. Zalando co-evolved Operation-based SLOs, Adaptive Paging, and Symptom-Based Alerting as three parts of one design.

Interaction with the error budget

Each CBO gets its own error budget. Budget depletion policies apply (feature freeze on CBO's owner-teams, etc.) at operation granularity. This gives the organisation a user-experience- grounded feature-velocity throttle, not a per-service one. Teams cannot ship new features whose integration would further consume a red CBO's budget.

Zalando's 2022 evolution (sources/2022-04-27-zalando-operation-based-slos) canonicalises the budget as the alert-threshold-driver, not just a post-hoc metric: Adaptive Paging moved from firing on raw error-rate > SLO to firing on MWMBR budget burn- rate — eliminating short-spike false positives and per-alert fine-tuning in one change. See concepts/error-budget for canonical treatment.

Contrast with SRE-book SLOs

Google's SRE book has always described SLOs keyed on user- facing SLIs — "end-to-end request success rate for the checkout service". The explicit operation-based framing Zalando adopts is the generalisation at the microservice scale where "the checkout service" is a fiction replaced by "the checkout operation", and the SLI is measured at the root span of the tracing graph rather than at any single service. Same underlying philosophy, scaled-up implementation.

Seen in

  • — Zalando names the pivot as a significant SLO-strategy change, positions it as consequence of Distributed Tracing + Adaptive Paging. Technical deep-dive referenced as a 2022-04 follow-up "Operation based SLOs" post.
  • sources/2022-04-27-zalando-operation-based-sloscanonical technical deep-dive for the concept on the wiki. Names the load-bearing formula SLO = Symptom + Target; documents the three service-based-SLO failure modes Zalando hit (high SLO count, product↔service mapping conflict, fine-grained alignment pain); introduces the top-down SLO ownership model (senior manager / VP owns the CBO's SLO, not the team); covers the pivot to transport-agnostic SLIs via OpenTracing error tag (beyond 5xx-rate, letting graceful degradation states count correctly); Adaptive Paging evolution from raw-rate trigger to MWMBR burn-rate trigger; ships as the new Service Level Management Tool. Dogfood numbers (3-month SRE- department trial): FP rate 56%→0%, alerts 2→0.14/day, 30+ cause-based alerts disabled, zero user-facing incidents missed. Headline supplementary benefits named: longevity (operation outlives service architectures), impact communication at CBO altitude ("50% errors on Add-to- Cart" legible in a way "50% errors on Service Foo" isn't), out-of-the-box alerts (once the CBO has an SLO, the alert is derived).
  • sources/2021-10-14-zalando-tracing-sres-journey-part-iii — documents the 2020 rollout of Operation-Based SLOs continued "by working closely with the senior management of several departments and agreeing on their respective SLOs." Reinforces that operation-based SLOs are a product- management conversation, not just engineering; also the input that MWMBR-based alert rule generation depends on.
Last updated · 542 distilled / 1,571 read