SYSTEM Cited by 1 source
Zalando Service Level Management Tool¶
Service Level Management Tool is Zalando's operation-based SLO reporting + management system — the explicit successor to the 2018 DX-scoped SLO Reporting Tool. It tracks SLOs defined per Critical Business Operation (not per service), displays per-CBO error-budget burn over rolling 28-day windows, and drives Adaptive Paging's multi-window multi-burn-rate alert thresholds.
Origin¶
Built by the unified Zalando Digital Foundation SRE team in 2021–2022, post the 2019 pivot from service-based to operation-based SLOs. First publicly named in the 2022-04-27 Operation Based SLOs post:
"As more and more CBOs were defined, we needed to improve the reporting capabilities of our tooling, and developed a new Service Level Management tool that catered to this operation based approach."
What it does¶
Three responsibilities parallel to the predecessor tool, but keyed on the CBO axis rather than the service axis:
- Per-CBO SLO dashboard. Each CBO ("Place Order", "View Catalog", "Add to Cart") has its own page showing target + current compliance. Screenshot in the post shows operation-named rows rather than service-named rows.
- Error-budget visualisation across three 28-day windows. The canonical Google-SRE-book window. Post's figure title: "Error Budget over three 28 day periods." Lets leadership see budget depletion trends over a full quarter and distinguish long-term reliability drift from short-term outages.
- Drives Adaptive Paging's MWMBR thresholds. The tool is the system-of-record for the SLO; Adaptive Paging reads from it to compute burn rates and fire alerts. "Because the Error Budget is derived from the SLO, it is still the SLO that made it possible to derive the alert threshold automatically."
Relationship to the older SLO Reporting Tool¶
| Attribute | SLO Reporting Tool (2018) | Service Level Management Tool (2022) |
|---|---|---|
| SLO altitude | Service | Operation (CBO) |
| Scope | DX department only | Rolled out org-wide as CBO catalogue grows |
| SLI primitive | Canonical per-service SLIs | CBO-level SLIs (root-span error/status) |
| Tier classification | Service tiers (3-tier) | Still applies, but to operations: Tier-1 CBOs → tight SLOs |
| Alert threshold | Raw error-rate breach | MWMBR error-budget burn |
| Budget visualisation | Per-service | Per-CBO, three 28-day windows |
| Feeds | Shown in dashboards only | Drives Adaptive Paging at alert time |
The older tool was not explicitly turned off: "even teams that did adopt CBOs, weren't disabling their cause based alerts" — so the two systems coexisted as the rollout proceeded.
Why it matters¶
The Service Level Management Tool is the platform-team deliverable that makes operation-based SLOs operational:
- Without an operation-keyed dashboard, VPs can't see their CBO's SLO status (they'd have to compound per-service dashboards manually).
- Without error-budget burn visualisation, MWMBR alerting has no user-facing view — engineers can't tell why their CBO is alerting without drilling into burn-rate state.
- Without an SLO system-of-record, Adaptive Paging can't derive its alert thresholds automatically — it would need per-rule configuration.
Prerequisites¶
- A defined CBO catalogue — the tool's rows are CBOs.
- Per-CBO SLI pipeline — distributed-tracing root-span aggregation for CBO error rate. See OpenTracing.
- Per-CBO SLO targets with executive ownership — a senior manager (Director / VP) owns each CBO's target.
- Error-budget arithmetic — derived from the SLO target and rolling window.
- Integration with systems/zalando-adaptive-paging for burn-rate alerting.
What the post doesn't disclose¶
- Internal architecture. Built in-house; no disclosure of whether it's a custom Go/Java service, a Grafana-plus- custom-backend, or something else.
- Storage substrate. No disclosure of how SLI time-series are stored (Prometheus, VictoriaMetrics, Druid, a bespoke service).
- Current scale. Number of tracked CBOs, number of teams onboarded, MWMBR alert volume — all undisclosed.
- UI details beyond the one figure. A single error-budget view is shown; the rest of the UI surface is not described.
- MWMBR threshold configuration. Whether engineers can tune the (short, long, burn-rate) tuples or whether they are platform-fixed is not stated.
Seen in¶
- sources/2022-04-27-zalando-operation-based-slos — canonical first public mention. Figure caption "Our Service Level Management Tool (operation based - not actual data)." Second figure "Error Budget over three 28 day periods."
Related¶
- systems/zalando-slo-reporting-tool — predecessor (service-based, 2018, DX-scoped).
- systems/zalando-adaptive-paging — downstream consumer of SLO data for MWMBR alerting.
- concepts/operation-based-slo — the SLO altitude the tool is keyed on.
- concepts/critical-business-operation — the alertable primitive.
- concepts/error-budget — the visualisation axis.
- concepts/multi-window-multi-burn-rate — the alerting strategy the tool's data powers.
- concepts/service-tier-classification — still applies at operation altitude.
- companies/zalando