ZALANDO 2021-10-14 Tier 2

Zalando — Tracing SRE's journey in Zalando - Part III¶

Summary¶

Third and final installment of Zalando's SRE retrospective (Koutsiaris + Koutsouraki, 2021-10-14), covering the 2020 transition from a single SRE team to an SRE department. The late-2019 Central Functions reorg folded the existing SRE Enablement team together with the teams building Zalando's monitoring services and infrastructure and with Incident Management under a single SRE department; shared purpose — "reduce the impact of incidents while supporting all builders at Zalando to deliver innovation to their users reliably and confidently" — was codified as the SRE Strategy published in 2020. The strategy anchored on Observability as standardization target: one common understanding of Observability across the company, SDKs for the major programming languages (per Zalando's Tech Radar), and reduced overhead of running multiple observability services. Four concrete product / process moves followed: (1) a new incident process that separates anomalies from incidents — driven by starting to measure incident-count, MTTR, false positive rate, customer impact; (2) continued rollout of Operation-Based SLOs guarded by Adaptive Paging, which was upgraded from SLO-threshold paging to Multi- Window Multi-Burn-Rate threshold calculation so that alert rules for operations can be derived automatically from the SLO without engineer trial-and-error; (3) a new Service Level Management tool that reports SLO and remaining Error Budget per operation, to make budget remainder a prioritization signal; (4) the SRE Curriculum — pandemic- era pivot from in-person training sessions (incident response, distributed tracing, alerting strategies) to video + quiz modules co-produced with the company Tech Academy and folded into Zalando onboarding. The department also spawned an Embedded SRE team for the Checkout product area — requested by Checkout senior management rather than pitched by SRE — reporting to the SRE department but aligned on dual KPIs (Availability via SLOs + On-Call Health measured by paging rate and individual on-call frequency) with the product area. Post ends the series with a candid "Were we 100% successful? No" and stakes the claim that postmortem-driven strategy evolution is how the program will keep moving.

Key takeaways¶

SRE graduates from team to department when monitoring / observability / incident-management teams need a shared charter. The 2019 reorg "centered around a set of principles, chief among them were 'Customer Focus', 'Purpose' and 'Vision'" folded SRE Enablement together with the monitoring services + monitoring infrastructure teams and Incident Management under a single department — not because someone pitched "we need an SRE department" but because those teams were already collaborating closely (Incident Commander + Postmortems
Distributed Tracing all crossed the boundary) and a shared reporting chain unlocked the synergies. This is the canonical Phase-2 → Phase-3 completion data point for concepts/sre-organizational-evolution. (Source: sources/2021-10-14-zalando-tracing-sres-journey-part-iii)
An SRE department needs a Strategy document, or it drifts into ad-hoc projects. Zalando names this directly: "Before, with a single team we could be (and occasionally had to be) more flexible, picking ad hoc projects. But now we had teams with a better defined purpose. And we wanted to have all teams working together towards a common goal." The 2020 SRE Strategy anchored on Observability as standardisation target because it is simultaneously the product of the monitoring teams, the driver of SLO-based work for SRE, and the input to Incident Management. One target that all three teams own something of. (Source: sources/2021-10-14-zalando-tracing-sres-journey-part-iii)
SLO-derived alert rules eliminate engineer tuning (eventually). The first Adaptive Paging iteration used the SLO target as the paging threshold and immediately hit the same false-positive pathology as any static-threshold alert: "it made our alerts too sensitive to occasional short lived spikes, similar to any other non-Adaptive Paging alert." Engineers were back to tuning per-alert criteria (time of day, throughput, error-rate length) — the exact thing the SLO was supposed to eliminate. Upgrading the handler to Multi-Window Multi-Burn-Rate threshold calculation derived the window lengths and thresholds automatically from the SLO, returning to the hands-off promise. See patterns/slo-derived-alert-rule-generation. (Source: sources/2021-10-14-zalando-tracing-sres-journey-part-iii)
Error Budget as the paging primitive replaces SLO as the paging primitive. A direct quote: "Deciding whether to page someone or not was no longer whether the SLO was breached or not, but rather whether the Error Budget was in risk of being depleted or not." This is a material semantic shift: an SLO target at 99.9% means a momentary dip below 99.9% still leaves plenty of budget over a 28-day window; paging on momentary SLO breach pages the wrong signal. Paging on burn-rate-against- budget is the whole point of concepts/error-budget. (Source: sources/2021-10-14-zalando-tracing-sres-journey-part-iii)
A new Service Level Management tool exposes Error Budget, not just SLOs — because Error Budget is what drives prioritisation. Zalando's earlier SLO Reporting tool (SLR) reported SLOs only. "As we evolved the concept of SLOs, so too did we evolve the tooling that supported it. Other than reporting SLOs for the different operations, we also gave a view on the Error Budget." Why this matters: "Knowing how much Error Budget is left makes it easier to use it to steer prioritization of development work." If the UI doesn't show remaining budget, teams can't do the feature- velocity-vs-reliability trade the budget is designed to enable. See systems/zalando-service-level-management-tool. (Source: sources/2021-10-14-zalando-tracing-sres-journey-part-iii)
The incident process should separate Anomalies from Incidents. Zalando's reframe: things that page aren't all incidents — many are anomalies (momentary signal, not user impact). A distinct process for anomalies (investigate, possibly tune or ticket) vs incidents (engage responders, mitigate, postmortem) reduces false- positive paging pressure on on-call engineers without losing the signal. This complements concepts/symptom-based-alerting: symptom alerting reduces noise; anomaly/incident separation reduces ceremony on the noise that remains. See concepts/anomaly-vs-incident-separation. (Source: sources/2021-10-14-zalando-tracing-sres-journey-part-iii)
Measure the incident-response pipeline itself, early and explicitly. "One of the first things we did after creating the department was to define the KPIs that would guide our work, make sure they were being measured, and facilitate the reporting of those KPIs." Zalando names four: incident count, MTTR, false positive rate, and customer impact. Without these, claims like "incident process is better now" are anecdotal. With these, changes like anomaly/incident separation become measurable. See concepts/sre-kpi-portfolio. (Source: sources/2021-10-14-zalando-tracing-sres-journey-part-iii)
Embedded SRE as a customer-pulled team shape. Zalando's Embedded SRE for Checkout was not an SRE- side roadmap item — "the senior management of that department officially pitched for the creation of an Embedded SRE team." Structural properties: (a) reports to the SRE department, not the product area; (b) dual KPIs agreed between SRE and product area management — Availability (driven by the product area's SLOs) and On-Call Health (paging rate + individual on-call frequency); (c) explicitly more hands-on than Enablement — can touch product code / tooling; (d) feeds back to the SRE department as a listening post. This is a distinct shape from the lone-SRE-per-team anti-pattern in Part I — it's a full embedded SRE team, not an individual. See patterns/embedded-sre-team-from-customer-pull and concepts/embedded-sre-team. (Source: sources/2021-10-14-zalando-tracing-sres-journey-part-iii)
On-Call Health is a first-class KPI, not a soft metric. Zalando names it as a KPI the embedded team will be measured on alongside Availability. Definition: "On-call Health will be measured taking into account paging alerts and how often an individual is on-call." The post explicitly names the motivation: "Pager fatigue is something that should not be dismissed, and can hurt a team through lower productivity and employee attrition." Treating on-call health as a measurable outcome of the reliability program — rather than a soft concern — is load-bearing for sustainable SRE. See concepts/on-call-health-metric. (Source: sources/2021-10-14-zalando-tracing-sres-journey-part-iii)
Observability standardisation requires language-specific SDKs, not shared dashboards. "The concrete step for making this possible was to develop SDKs for the major programming languages at use in Zalando." Shared dashboards are downstream; the upstream lever is: every language emits observability signals the same way. This is the lesson that generalised Zalando's earlier OpenTracing rollout — get the instrumentation substrate standardised first, then the tools built on top are easier to maintain. See patterns/standardize-observability-sdk-per-language. (Source: sources/2021-10-14-zalando-tracing-sres-journey-part-iii)
Async training curriculum survives remote work and scales onboarding. The pandemic forced Zalando's ad- hoc in-person SRE trainings off the calendar. Rebuild: video + quiz modules co-produced with the company Tech Academy, content reviewed by subject-matter experts, folded into new-hire onboarding. Three topics explicitly named: incident response, distributed tracing, alerting strategies. Outcome: "any engineer joining Zalando would get an introduction to some of the SRE practices we were rolling out." Async curriculum scales to hundreds of teams; in-person sessions don't. See concepts/sre-curriculum. (Source: sources/2021-10-14-zalando-tracing-sres-journey-part-iii)

Systems mentioned¶

systems/zalando-adaptive-paging — upgraded in 2020 from SLO-threshold paging to Multi-Window Multi-Burn- Rate threshold calculation derived automatically from the SLO. First-named 2020-10 (Cyber Week retrospective), canonicalised 2021-09-20 (Part II), upgraded here.
systems/zalando-service-level-management-tool — successor to the original SLO Reporting tool (SLR). Adds Error Budget view per operation so that teams can use remaining budget as a prioritisation signal. Introduced 2020.
systems/google-sre-book — cited via the Service Reliability Hierarchy (Observability at base) as the intellectual anchor for why the SRE Strategy chose Observability as the standardisation target. Multi-Window Multi-Burn-Rate is cited from the SRE Workbook.
Adaptive Paging (SREcon19 EMEA talk) — Mineiro's conference presentation of Adaptive Paging, linked verbatim from the post.
Symptom Based Alerting presentation — 2019 public slide deck on Zalando's symptom-based alerting strategy, linked verbatim.

Concepts extracted¶

concepts/sre-organizational-evolution — Phase-2 → Phase-3 completion data point: 2019 reorg folds SRE Enablement + monitoring teams + Incident Management into one department with a 2020 SRE Strategy anchored on Observability.
concepts/adaptive-paging — 2020 upgrade from SLO- threshold to MWMBR threshold calculation.
concepts/multi-window-multi-burn-rate — adopted by Adaptive Paging to eliminate engineer-tuned alert rules.
concepts/error-budget — promoted to the paging primitive (previously SLO was); promoted to the UI primitive in the new Service Level Management tool.
concepts/operation-based-slo — rollout continued in 2020 with senior-management engagement across multiple departments.
concepts/symptom-based-alerting — cited as a false- positive-rate reducer complementary to anomaly/incident separation.
concepts/observability — named the standardisation target in the 2020 SRE Strategy.
concepts/on-call-health-metric — first wiki page. Paging rate + individual on-call frequency as a first- class KPI for reliability programs.
concepts/anomaly-vs-incident-separation — first wiki page. Process split: anomalies (noise, investigate) vs incidents (ceremony, responders, postmortem).
concepts/sre-kpi-portfolio — first wiki page. Four KPIs Zalando names: incident count, MTTR, false positive rate, customer impact.
concepts/embedded-sre-team — first wiki page. Full embedded team (not lone-SRE), reports to SRE dept, dual KPIs with product area.
concepts/sre-curriculum — first wiki page. Async video + quiz modules in onboarding for SRE practices.

Patterns extracted¶

patterns/slo-derived-alert-rule-generation — first wiki pattern. Burn-rate window + threshold derived automatically from SLO target; zero engineer tuning per-operation.
patterns/embedded-sre-team-from-customer-pull — first wiki pattern. Product-area management requests the embedded team; SRE dept provisions it; dual KPIs agreed between the two reporting chains.
patterns/standardize-observability-sdk-per-language — first wiki pattern. Language-specific SDKs (JVM, Go, Python, ...) emit observability signals the same way, so upstream standardisation enables downstream tools.

Operational numbers¶

2019 Central Functions reorg: folded 3 team groupings (SRE Enablement, monitoring services + infrastructure, Incident Management) into 1 department.
2020: SRE Strategy published.
Observability SDKs: for the major programming languages on Zalando's Tech Radar; exact list not disclosed (typically JVM languages, Go, Python, JavaScript based on other Zalando posts).
Embedded SRE team: 1 instance (Checkout); reports to SRE department; dual KPIs (Availability + On-Call Health).
SRE Curriculum topics named explicitly: 3 (incident response, distributed tracing, alerting strategies).
KPIs named for incident pipeline: 4 (incident count, MTTR, false positive rate, customer impact).
On-Call Health metric inputs: 2 (paging rate + individual on-call frequency).

Caveats¶

Single-company retrospective. No quantitative comparison of incident KPIs before vs after the 2020 changes — "false positive rate dropped" is stated as expected, not measured, in this post.
The Embedded SRE team for Checkout is presented as an exciting new development "still figuring out most things as we go along". Durability not yet proven at publication time (2021-10); whether it scaled to more product areas is not disclosed.
The Error Budget UI claim is load-bearing but not illustrated with real numbers — the screenshot in the post is labeled "not actual data".
Multi-Window Multi-Burn-Rate is attributed to Google (with a direct citation to the SRE Workbook). This post doesn't innovate on the algorithm; it describes Zalando's adoption.
The three-phase model in concepts/sre-organizational-evolution presents the phases as organic. Zalando's own history shows Phase 3 arrived via a top-down reorg, not a continuation of Phase 2. The reorg was the forcing function, not the grassroots efforts.
Hiring question raised but not resolved: "Now we're planning to bootstrap another team, so that cannot be making things any easier." How they actually staffed it is hinted at ("candidates with potential to join one of the teams in the department, and from there grow into the SRE role") but not elaborated with numbers.
Scope note: this post does not revisit or re- evaluate the "you build it, you run it" ownership model from Part I. Service teams still run their own services; the department is a platform + enablement function per concepts/sre-organizational-evolution.

Contradictions¶

None with existing wiki claims. Part III extends and reinforces the Part II claims about Adaptive Paging + Operation-Based SLOs + Multi-Window Multi-Burn-Rate, and adds new primitives (anomaly/incident separation, on-call health KPI, embedded SRE team shape, SRE Curriculum).

Source¶

Original: https://engineering.zalando.com/posts/2021/10/sre-journey-part3.html
Raw markdown: raw/zalando/2021-10-14-tracing-sres-journey-in-zalando-part-iii-ec4efffd.md