ZALANDO 2021-09-12 Tier 2

Zalando — Tracing SRE's journey in Zalando - Part I¶

Summary¶

First installment of Zalando's three-part retrospective on adopting Site Reliability Engineering, covering 2016 — the first (failed) attempt. The story: Zalando is mid-migration from monoliths to microservices on AWS, headcount hits 1,000+, the existing 5-team on-call rotation cannot scale, and Google's newly-published SRE book arrives as inspiration. A grassroots coalition of engineers pitches SRE to management, wins initial buy-in, and debates the structural question — central team vs. one-SRE-per-team vs. one SRE team per Product Cluster (the chosen shape). They survey the org for an "SRE profile" (software engineering + operational mindset + systems engineering + software architecture + troubleshooting), roll out SLOs and SLIs as the baseline reliability-measurement primitive, build the first SLO reporting tool (SLR), and run Reliability Workshops for Cyber Week on retries, circuit breakers, and fallbacks. The attempt stalls: SLOs stay engineer-driven and don't influence product decisions; Product Managers can't link SLOs to product outcomes; management is kept informed but not engaged. The resolution is a pivot — instead of ramping SRE teams, senior management decides every delivery team goes on-call for its own critical services, crystallising the "you build it, you run it" mentality that still defines Zalando today. Part I ends mid-journey; Parts II & III cover how SRE re-emerges under the new constraint.

Key takeaways¶

Grassroots SRE is the typical first phase, and it can fail. Zalando's 2016 attempt is the canonical counter-example: management bought in, SLOs rolled out, workshops ran — but the program stalled when SLOs never became a product-management primitive. This is Phase 1 of concepts/sre-organizational-evolution with a negative outcome, a useful data point against the phase model being universally upward. (Source: sources/2021-09-12-zalando-tracing-sres-journey-in-zalando-part-i)
The structural question is the hardest one. Zalando explicitly debated three SRE team shapes: (a) one central team — rejected because the org was already 1,000+ engineers so no central team could cover the surface; (b) one SRE per delivery team — rejected because scope would be too wide and the lone SRE would degrade into the team's Ops engineer; (c) one SRE team per Product Cluster — chosen, gives end-to-end domain responsibility without too-wide scope. See patterns/sre-team-per-product-cluster. (Source: sources/2021-09-12-zalando-tracing-sres-journey-in-zalando-part-i)
Reporting chain matters — SRE must be separated from product delivery. Zalando followed the Google SRE workbook guidance to treat reliability work as a specialised role with its own reporting chain, not as a function inside product teams. The alternative (embed in delivery teams) has a known failure mode: SRE gets re-absorbed as Ops. (Source: sources/2021-09-12-zalando-tracing-sres-journey-in-zalando-part-i)
The SRE "profile" as a hiring filter. Zalando's internal SRE-interest survey named five skills: software engineering, operational mindset, systems engineering, software architecture, troubleshooting. Used both as a self-selection signal for internal mobility and as a way to gauge the talent pool. Disclosed verbatim by the post. (Source: sources/2021-09-12-zalando-tracing-sres-journey-in-zalando-part-i)
SLOs without product-manager buy-in don't change behaviour. The failure mode Zalando names directly: "The vast majority of SLOs were defined through initiatives from Engineers. But in a microservice architecture, a product is implemented by multiple services. Product Managers had a hard time establishing a link between the different SLOs and their own expectations for the products they are responsible for. Management was kept in the loop, but not directly involved, so there was no real motivation for management to uphold the SLOs." SLO adoption is a socio-technical problem, not a technical one. (Source: sources/2021-09-12-zalando-tracing-sres-journey-in-zalando-part-i)
Reliability Workshops as Cyber Week prep. The grassroots SRE team ran workshops for the most critical services covering Retry Strategies, Circuit Breakers, and Fallbacks — using the annual Cyber Week peak event as the forcing function to drive adoption. This is the early instance of patterns/annual-peak-event-as-capability-forcing-function at Zalando, later expanded in the 2020 Cyber Week post. (Source: sources/2021-09-12-zalando-tracing-sres-journey-in-zalando-part-i)
"You build it, you run it" was a consequence of SRE failing, not an alternative to SRE. The ultimate resolution: "The way that was chosen to kickstart that capability building was by putting each delivery team on-call for the critical services they owned. This decision was fundamental to properly establish the 'you build it, you run it' mentality we still have today." Zalando's ownership model was not ideologically chosen — it emerged when the grassroots SRE rollout stalled and senior management needed an alternative. (Source: sources/2021-09-12-zalando-tracing-sres-journey-in-zalando-part-i)

Systems mentioned¶

Google SRE book (2016) — foundational reference. Its publication is what triggered the Zalando attempt.
Google SRE workbook — cited for the specific guidance to separate reliability work from product delivery in the reporting chain.
SLO Reporting tool (SLR) — Zalando's first in-house SLO reporting tool, built during this period. No further detail disclosed; not a named product.

Concepts extracted¶

concepts/sre-organizational-evolution — adds the failed-grassroots data point to the three-phase model.
concepts/service-level-objective — first wiki page.
concepts/service-level-indicator — first wiki page.
concepts/you-build-it-you-run-it — first wiki page. The ownership model gets its canonical origin story here.
concepts/on-call-rotation — first wiki page. Notes the scaling break point: 5 teams covered a monolith fleet; the same 5 teams could not cover a microservice fleet.
concepts/microservices-migration — first wiki page (minimum viable). Zalando's 2016 migration is the backdrop.

Patterns extracted¶

patterns/grassroots-sre-rollout — first wiki pattern page. Bottom-up coalition pitches SRE to management, secures charter, rolls out SLOs + workshops.
patterns/sre-team-per-product-cluster — first wiki pattern page. The middle-ground team structure Zalando chose. End-to-end domain responsibility; no lone-SRE anti-pattern; no impossible central team.

Operational numbers¶

~1,000+ engineers at Zalando in 2016 when the SRE initiative started.
5 on-call teams covered the monolith fleet pre-cloud. Explicitly named as the number that could not scale into the microservice era.
Hubs at the time of the survey: Helsinki, Dublin, Dortmund (plus Berlin HQ, implied).
Reliability Workshop scope: "the more critical services" — exact count not disclosed.
Patterns covered in workshops: Retry Strategies, Circuit Breakers, Fallbacks (three named).

Caveats¶

Single company, single retrospective written 5 years after the fact. Memoir, not measurement.
No post-mortem metrics on the failed attempt — Zalando doesn't disclose SLO adoption %, workshop attendance, or incident-rate change. The "it failed" conclusion is narrative, not quantitative.
Part 1 of 3. Some claims (e.g. "you build it, you run it" stuck) are promised to be elaborated in Parts 2 and 3.
The post uses "SRE" somewhat loosely — sometimes the role, sometimes the team structure, sometimes the practice. Read generously.
Zalando's conclusion (delivery-team on-call as the resolution) is not a universal recommendation. It worked in a culture explicitly described as one that "does not shy away from change". Other cultures may need different interventions.

Contradictions¶

vs. concepts/sre-organizational-evolution (original Cyber Week source): The three-phase model presents SRE adoption as monotonically upward (Phase 1 → 2 → 3). This post shows that Phase 1 can fail and rewind — Zalando's 2016 attempt stalled, ownership model changed to team on-call, and SRE only re-emerged later under the new constraint. The phase model should be read as the usual trajectory, not the only one.

Source¶

Original: https://engineering.zalando.com/posts/2021/09/sre-journey-part1.html
Raw markdown: raw/zalando/2021-09-12-tracing-sres-journey-in-zalando-part-i-e9093c3f.md