ZALANDO 2021-09-20

Zalando — Tracing SRE's Journey in Zalando Part II¶

Summary¶

Christos Koutsiaris + Tanya Koutsouraki (Zalando, 2021-09-20) narrate Zalando's second SRE iteration (2018–2019) after the 2016 grassroots cohort wound down. Two sister SRE teams bootstrap in Q1 2018 — SRE Enablement in Digital Foundation (DF, central functions) and Digital Experience SRE (DX, customer-facing Fashion Store) — each with only 2 engineers, united under an SRE Program for 2018. In 2019 the two teams merge into a single DF team of 7. The post canonicalises Zalando's rollout of Distributed Tracing as the fleet-wide observability substrate, and four novel platform capabilities built on top of it: Adaptive Paging, a Throughput Calculator for load-test capacity planning, Symptom-Based Alerting strategy, and a pivot from service-based to Operation-Based SLOs. Source page for the Zalando SRE evolution's Phase 2 → Phase 3 transition and the Critical-Business-Operation primitive that underlies multiple downstream tools.

Key takeaways¶

Q1 2018 two sister SRE teams bootstrap simultaneously — SRE Enablement (DF) reimagined top-down by management to raise the bar on monitoring / incident response / chaos engineering / resilience; Digital Experience SRE (DX) formed from grassroots initiative. Each had 2 engineers, shared goals, differing scopes (DF = company-wide, DX = one department). Teams united under an SRE Program (1 Lead + 1 Program Manager + 4 Engineers = 6 people total) to align efforts while belonging to different reporting chains (Source: sources/2021-09-20-zalando-tracing-sres-journey-part-ii).
SRE is Enablement, not On-call-takeover — the new name ("SRE Enablement") differentiates from the 2016 iteration. The team does not take on-call from service teams (2017 decision to grow on-call teams stuck); instead it ships cross-cutting capability. "The challenge that gave purpose to the Enablement team was raising the bar on our operational practices."
Carefully picking battles > accepting everything — with 6 people across two teams and a list of "all topics that were SRE relevant," they dropped many topics they wanted to work on. "That careful selection contributed significantly to the success of the Program, and the reputation we built for the SRE name within the company." The three dimensions used: likelihood of success + company priorities + enablement value (does the partnering team learn / does the SRE team have to do it all themselves?).
2018 SRE Program initiative list: rollout of Distributed Tracing, Page Load Time improvements, staffing the newly- created Incident Commander role (on-site during Black Friday in a dedicated Situation Room), Cyber Week prep / Load Tests, and efficiency projects with significant cloud-cost savings while preserving reliability.
SLOs exist at Zalando since 2016 but not used for prioritization — "Despite the growing number of SLOs, they were still not used to help the teams strike a balance between feature development and operational improvements." Two problems: (a) >4,000 services, not all the same importance; (b) service-based SLOs don't map cleanly to user journeys.
Service Tier definitions published — service-tier classification by criticality to structure the SLO portfolio; scope-limited to DX department (no company-wide mandate; rolling out for 4,000+ services infeasible).
New SLO reporting tool with canonical SLIs — built for the Service Tier rollout with standardised Service Level Indicators; Tier-aware reporting. Scope: DX department services only. (Zalando systems/zalando-slo-reporting-tool.)
SRE Guild as cultural-change agent — SRE Program takes ownership of the dormant-since-2016 SRE Guild; self-organized knowledge-exchange group with regular sessions, talks by Program members and other engineers, Postmortems as a standing topic. Format still in place in 2021.
2018 program's organizational inconsistency problem — teams in Zalando seeking SRE guidance "not knowing which team to reach out to, or even that there were 2 separate teams." The DX team could ship department-specific solutions; the DF team had to ship company-wide → the two sometimes gave inconsistent guidance on the same question. Drives the 2019 merger decision.
2019 — two SRE teams officially merge into single DF team — "With this merger, SRE now had a single voice in the company." patterns/unified-sre-team-over-federated — Phase-2 → Phase-3 transition in concepts/sre-organizational-evolution terminology.
Distributed tracing as post-2018 platform substrate — four value-adds beyond debugging: (1) faster incident response ("a fundamental tool for incident response because it allowed for quicker insights, saving time from incidents"); (2) growing coverage; (3) Zalando-specific Semantic Conventions (in addition to OpenTracing standardised conventions); (4) API to consume tracing data — which "allowed the SRE team to build additional value from it." The API is the load-bearing primitive: without it, Adaptive Paging / Throughput Calculator / Operation-based SLOs would all be infeasible.
Adaptive Paging as first tracing-driven ops primitive — "an Alert Handler called Adaptive Paging which monitors the error rate of what we call Critical Business Operations (CBO) and when it is triggered it uses the tracing data to determine where the error comes from across the entire distributed system, and pages the team that is closest to the problem." Presented at SRECon'19 EMEA (Mineiro presentation). Post explicitly positions this as "a game changer in our push for a different alerting strategy: Symptom Based Alerting."
Critical Business Operations (CBO) as the symptom-level primitive — the alertable unit is not a service's error rate but a business operation's (checkout, order placement, item view) error rate, measured at the top-level span of the traced request. One CBO error-rate alert fans out into a per-firing paging decision via trace-graph traversal. See concepts/critical-business-operation.
Throughput Calculator — built on tracing data for Cyber Week load-test capacity planning. "By applying the expected throughput for a CBO, we could estimate the impact on all the components that are part of the same journey, usually through cascading remote procedure calls." Input: expected CBOs/min; output: projected per-downstream-service RPS fan-out. Load- bearing because Cyber Week traffic is driven by top-of-funnel user behavior (browsing → add-to-cart → checkout ratios), not directly by individual-service load. (Zalando systems/zalando-throughput-calculator.)
2019 pivot from service-based to Operation-Based SLOs — "we made a significant change in our SLO strategy. We moved away from service based SLOs, and started rolling out Operation based SLOs." Teaser reference to a 2022-04 follow- up blog post. The SLO now measures the user-visible business operation's success rate end-to-end, not the individual service's availability. Tracing + CBO taxonomy make this computable. See concepts/operation-based-slo.
Hiring stays uncompromising, team size caps natural growth — team grew through internal + external hiring to 7 SREs by 2019. "The combination of the required skill set for an SRE at Zalando and the different definitions of the SRE role across the industry, means many candidates do not meet the bar." Explicitly rejects growing-engineers-into-SRE at this scale: "with our reduced size we could not provide an effective mentorship. Any engineers we would hire needing that mentorship would not be set up for success."
2018 vs 2019 difference — "In 2018 we worked exclusively on topics that SRE did not own. We were a mix of a consulting team and a kitchen sink team." 2019 the team starts working on its own products → starts owning roadmap.
Success brings demand-side scaling problem — "the team became increasingly more in demand from different parts of the organization. Our help was requested to improve Operational Excellence in departments, to assist in the roll out of major launches, to review Technical Design Documents, to help in PostMortem investigations, Cyber Week preparations, Production Readiness Reviews…" Continued careful battle-picking required. "Accepting every challenge with our reduced capacity meant that we would likely do a poor job in all of them. And anything in our backlog that we had promised and wouldn't deliver would also affect our reputation."

Systems named¶

Zalando Adaptive Paging — alert handler monitoring Critical Business Operations error rate; traverses trace to page team closest to root cause. Built on top of the Distributed Tracing API. Presented at SRECon'19 EMEA.
Zalando Throughput Calculator — estimates per-downstream-service RPS fan-out from expected CBOs/min; used for Cyber Week load-test capacity planning.
Zalando SLO Reporting Tool — DX-scoped SLO reporting with canonical SLIs keyed by Service Tier classification.
OpenTracing — the tracing substrate; Zalando extended it with Zalando-specific Semantic Conventions.

Concepts named¶

Critical Business Operation (CBO) — the symptom-level alertable primitive; a user-facing business operation (e.g. checkout) whose error rate is monitored at the top-level span.
Symptom-Based Alerting — alert on user-visible symptoms (CBO error rate) not on individual-service cause metrics; pairs with adaptive paging to resolve the "but then who do we page" gap that symptom- based alerting opens.
Operation-Based SLO — SLO defined per business operation (end-to-end user journey), not per service; made computable by distributed tracing + CBO taxonomy.
Service Tier classification — criticality-based tiering of services to structure SLO priorities.
SRE Program — multi-team coordination structure; aligns SRE efforts across different reporting chains.
Adaptive Paging — canonical already; extended by this source.
SRE organizational evolution — Phase 2 → Phase 3 transition (two-team federated → unified department) narrated directly.

Patterns named¶

Unified SRE team over federated — named pattern: federated SRE teams in different reporting chains produce inconsistent guidance and identity confusion; merging under one management chain resolves both costs while keeping the team small.
Annual peak event as capability forcing function — extended with 2018 Cyber Week's Distributed Tracing rollout + Incident Commander staffing + Situation Room.
Situation room for peak event — Black Friday 2018 dedicated Situation Room staffed by SREs in Incident Commander role.

Operational numbers¶

Two 2-engineer SRE teams bootstrap simultaneously in Q1 2018 (SRE Enablement in DF, Digital Experience SRE in DX).
6 SRE Program members in 2018: 1 Lead + 1 Program Manager + 4 Engineers.
7 SREs in the unified 2019 team (DF department).
>4,000 services in Zalando's service landscape — reason Service Tier + SLO reporting rollout scoped to DX department only.
100+ on-call teams by 2020 (from prior 2020-10-07 post and axis-5 context) — informs the scale at which Adaptive Paging / Symptom-Based Alerting become load-bearing.
2018 initiatives list: Distributed Tracing rollout, Page Load Time, Incident Commander staffing, Cyber Week Load Tests, efficiency projects with significant cloud-cost savings.
2019 initiatives list: Adaptive Paging (SRECon'19 presentation), Throughput Calculator, Operation-Based SLO rollout; team grows through hiring.

Caveats¶

Service Tier + SLO reporting tool scope limited to DX department — "Attempting to roll this out for the entire company (>4000 services) would not be feasible." The symmetric problem isn't solved in Part II — it's a company-scale classification gap the SRE team acknowledges but cannot staff.
Part-of-a-series: Part I (2021-09-12) covers the 2014–2017 grassroots arc; Part III (2021-10-14) covers 2020-onward. This Part II is 2018–2019 exclusively. Earlier and later context is outside scope here.
2022-04 Operation-Based SLO post is the technical deep-dive companion — this Part II only announces the strategy change, not how operation-based SLOs are computed from tracing data or how they interact with service-level ownership. Left as a forward reference.
No numbers on CBO taxonomy size — how many CBOs does Zalando define? What's the catalogue authoring process? Unaddressed.
SRECon'19 talk (Mineiro) is the authoritative Adaptive Paging technical reference — the blog post names it but doesn't detail the heuristics inside the trace-graph traversal that scores "team closest to the problem." See concepts/adaptive-paging for the canonical heuristic- properties coverage.
Symptom-Based Alerting strategy slides at github.com/zalando/public-presentations — referenced as "slides of one of the talks we did on this topic"; not reproduced in the blog post.
"Why did the 2016 SRE initiative wind down" — not covered here beyond "the plans for SRE in Zalando having to change." Part I presumably covers the 2017 retrospective.
Team-demand explosion risk acknowledged, not solved — Part II ends with the SRE team being "increasingly more in demand" and having to continue careful prioritization; the structural answer (platform-ify the recurring requests, hire liaison SREs per department, etc.) is not in scope.

Source¶

Original: https://engineering.zalando.com/posts/2021/09/sre-journey-part2.html
Raw markdown: raw/zalando/2021-09-20-tracing-sres-journey-in-zalando-part-ii-2e148c98.md

companies/zalando
concepts/adaptive-paging · concepts/sre-organizational-evolution · concepts/critical-business-operation · concepts/symptom-based-alerting · concepts/operation-based-slo · concepts/service-tier-classification · concepts/sre-program · concepts/production-readiness-review · concepts/alert-fatigue · concepts/observability · concepts/traffic-source-tagging-in-traces
patterns/unified-sre-team-over-federated · patterns/annual-peak-event-as-capability-forcing-function · patterns/situation-room-for-peak-event
systems/zalando-adaptive-paging · systems/zalando-throughput-calculator · systems/zalando-slo-reporting-tool · systems/opentracing · systems/opentelemetry
sources/2020-10-07-zalando-how-zalando-prepares-for-cyber-week — prior Zalando SRE / Cyber Week source; Adaptive Paging first named there, this post canonicalises the CBO + Symptom-Based-Alerting + Operation-based-SLO stack it sits in.