CONCEPT Cited by 1 source
Operation-based SLO¶
An Operation-Based SLO is a Service Level Objective defined per business operation / user journey (e.g., "checkout success rate ≥ 99.9%", "search latency p99 ≤ 500ms") rather than per individual service. It is a direct consequence of CBO taxonomy + fleet- wide distributed tracing: once the root span of a user-facing operation carries success/failure attribution across N services, its SLO is measurable without summing per-service availabilities.
Definition¶
"Finally, through our use of Distributed Tracing, and Adaptive Paging, we made a significant change in our SLO strategy. We moved away from service based SLOs, and started rolling out Operation based SLOs." — sources/2021-09-20-zalando-tracing-sres-journey-part-ii
Three defining properties:
- Scope is the user-facing operation, not the service — "checkout", "item view", "add to cart", "sign in", each with its own error-budget and target.
- SLI is measured at the root span of the traced request, so a failure anywhere in the compound trace (including non-critical downstream) counts as a CBO failure.
- Ownership is the operation's orchestrating team, not a single service team — though fault routing uses adaptive paging to reach the actually-responsible service.
The service-based SLO failure mode¶
Service-based SLOs are the SRE-book default but collapse at microservice scale:
- Compound availability gap. If a user journey touches 15 services at 99.9% each, compound availability is 0.999^15 ≈ 98.5% — vastly worse than any single SLO suggests. Users experience 1.5% failure; dashboards show all green.
- Fan-out attribution. A per-service SLO gets "credit" for being available even when its response made the upstream fail (e.g., returns stale data, takes too long, returns 200-wrapping-500). Service X can be 99.99% per its SLO and simultaneously responsible for the CBO's 1% failure rate.
- Dependency hell for fault budgeting. If a compound user journey's error budget is consumed by service X's slow response, neither the upstream service nor X's service-level SLO flags it — the organisation's error-budget framework fails to allocate blame / guide fixing.
- Reliability theater. Teams optimise their own SLO without regard to the user journey. A green per-service SLO dashboard is compatible with a user revolt.
How operation-based SLOs fix it¶
- The SLI is "was the user's journey successful end-to-end?" measured at the root span. All compound failures surface.
- The SLO target is set at the user-experience level (99.9% checkout availability) rather than arbitrarily inherited per-service.
- Error-budget depletion triggers org-wide response — not just the service that happens to be at the tail of the trace.
- Service-tier classification + operation-based SLOs together give an org the SLO portfolio it needs: the Tier-1 CBOs (checkout, payments, sign-in) have tight SLOs; Tier-3 CBOs have looser ones; individual services no longer need bespoke SLOs.
Prerequisites¶
- Distributed tracing deployed on the CBO's full path with error + status attribution on the root span. OpenTracing / OpenTelemetry with canonical semantic conventions.
- CBO taxonomy — named catalogue of user-facing operations with clear failure definitions. See concepts/critical-business-operation.
- Service Tier classification — to prioritise which CBOs get SLOs and what the targets should be.
- Error-budget policy agreed org-wide — feature velocity vs reliability tradeoffs only work if someone enforces the budget across teams touching the operation.
- Multi-service ownership model — a single service team owning an operation's SLO doesn't work when the operation spans 15 teams; typically a platform / reliability team owns the SLO and uses adaptive paging to route alerts.
Interaction with adaptive paging¶
Operation-based SLOs raise a question service-based SLOs did not: "when the SLO is breached, who gets paged?" Service- based SLOs have an implicit answer (page the service team). Operation-based SLOs need a routing layer — this is precisely concepts/adaptive-paging's role. Zalando co-evolved Operation-based SLOs, Adaptive Paging, and Symptom-Based Alerting as three parts of one design.
Interaction with the error budget¶
Each CBO gets its own error budget. Budget depletion policies apply (feature freeze on CBO's owner-teams, etc.) at operation granularity. This gives the organisation a user-experience- grounded feature-velocity throttle, not a per-service one. Teams cannot ship new features whose integration would further consume a red CBO's budget.
Contrast with SRE-book SLOs¶
Google's SRE book has always described SLOs keyed on user- facing SLIs — "end-to-end request success rate for the checkout service". The explicit operation-based framing Zalando adopts is the generalisation at the microservice scale where "the checkout service" is a fiction replaced by "the checkout operation", and the SLI is measured at the root span of the tracing graph rather than at any single service. Same underlying philosophy, scaled-up implementation.
Seen in¶
- sources/2021-09-20-zalando-tracing-sres-journey-part-ii — Zalando names the pivot as a significant SLO-strategy change, positions it as consequence of Distributed Tracing + Adaptive Paging. Technical deep-dive referenced as a 2022-04 follow-up "Operation based SLOs" post.