Skip to content

ZALANDO 2020-10-07

Read original ↗

Zalando — How Zalando prepares for Cyber Week

Summary

Christos Koutsiaris (Zalando, 2020-10-07) gives a three-theme retrospective on how Zalando's engineering organization evolved its preparation for Cyber Week — the Black Friday / Cyber Monday peak — across six years. The architecturally load-bearing content is a phased evolution of two disciplines, SRE organizational evolution and live load testing in production, plus a named alert-routing mechanism, adaptive paging, that uses OpenTracing causality data to page the team closest to a problem instead of the alert owner. Scale numbers that anchor the piece: 840,000 new customers in 2019 Cyber Week; GMV up 32% YoY; peak orders per minute 7,200 vs 4,200 the year before (+71% YoY); ~100 on-call teams today vs a handful six years prior; 1,122 applications in scope of Cyber Week preparations out of a 4,000+ app landscape.

The SRE evolution runs in three phases: (Phase 1) a grassroots team of 10 SRE-passionate engineers runs production readiness reviews (the SRE-book named practice) across the fleet ahead of Cyber Week; (Phase 2) OpenTracing-based distributed tracing is rolled out — first to tier-1 hot-path browse services, then to tier-2 — with explicit traffic-source tagging (App / Web / push / load-test) to support capacity planning; (Phase 3) the grassroots effort becomes a dedicated SRE department running the SRE guild, observability infrastructure, and the adaptive paging alert handler — a single alerting rule that runs heuristics over tracing causality + OpenTracing semantic conventions to page the most probable cause's team rather than the alert owner. Cited as a direct reduction in alert fatigue; SRECon EMEA 2019 talk "Are We All on the Same Page? Let's Fix That" is the canonical external reference.

The load testing evolution runs in parallel: (Phase 1) over-provisioned on-prem Postgres + Solr fleet survived 2015 Cyber Week by luck ("with no past knowledge about what type of traffic to expect we were amazed how much more headroom our backend systems really had"), with one live-system tuning (pausing non-essential async processing) as the escape valve; (Phase 2) write in-house simulators that place test sales orders and skip fulfillment plus a user-journey simulator across the shop's key touchpoints, ran live in production mirroring sales- event traffic patterns per country; (Phase 3) replace the in-house simulator with an off-the-shelf product after finding tuning costs too high — paid off by simultaneously load-testing App and Web platforms. The discipline produced confidence-level capacity-planning numbers ("the platform was scaled to sustain a certain amount of incoming traffic and sales in the peak minute") that gated the commercial-team's planning.

The article closes on Cyber Week as a project-management forcing function: because preparation is dedicated (Program Managers, escalated attention, cross-team structure), each year the org can invest in one new capability (resilience engineering knowledge, load testing infrastructure, capacity planning, production-readiness reviews, collaboration patterns) that then serves the rest of the year. During the event itself, Zalando runs a Situation Room: key engineering reps, SRE team, and dedicated Incident Commanders watching dozens of screens in a control-center format (2019 had it physical; 2020 forced a rethink with remote work).

Key takeaways

  • Adaptive paging: trace-causality-driven alert routing as an alert-fatigue fix. Zalando's named alert handler runs one alerting rule that "leverages the causality from tracing and OpenTracing's semantic conventions to page the team closest the problem. From a single alerting rule, a set of heuristics is applied to identify the most probable cause, paging the respective team instead of the alert owner." This is a distinct primitive from both routed alert rules (static mapping from alert → team) and single-owner paging: it uses live trace data to compute the probable cause at alert time. Canonicalised as concepts/adaptive-paging. Cross-link: concepts/alert-fatigue, systems/opentracing, concepts/observability.

  • SRE as a three-phase organizational evolution: grassroots → embedded practice → dedicated department. Zalando's path (2014–2020) canonicalised as concepts/sre-organizational-evolution: (P1) 10-person volunteer team running production-readiness reviews to educate on-call teams on reliability patterns (dependency-failure handling, overload, timeouts); (P2) observability primitive (OpenTracing) rolled out with tier-gated scope expansion; (P3) a formal SRE org owning monitoring / logging / tracing infra + the SRE guild (knowledge exchange + best-practice formulation). "What started as a grass-roots movement around SRE practices in Phase 1, has evolved to a SRE department within Zalando." Generalises to any eng org scaling from <10 → ~100 on-call teams.

  • Production Readiness Review (PRR) as a pre-peak gate. Zalando references the Google SRE book's PRR practice directly (landing.google.com/sre/sre-book/chapters/evolving-sre-engagement-model) and applies it at fleet scale: identifying "clusters of applications that required adjustments, so that the platform is stable in case of various failure types." Canonicalised as concepts/production-readiness-review; Cyber Week is the forcing function that makes the review non-optional.

  • Traffic-source tagging in traces enables capacity planning. Zalando's second-phase OpenTracing rollout adopted conventions to tag each request's originating traffic class — App / Web / push notifications / load tests — in its span metadata. "This allows us to better understand traffic patterns and perform capacity planning based on the request ratios between incoming traffic and the respective parts of our platform." A named practice distinct from generic trace-tagging: canonicalised as concepts/traffic-source-tagging-in-traces. The load-test tag is especially load-bearing — traces from live production load tests have to be distinguishable from real user traffic to avoid corrupting capacity-planning dashboards.

  • Live load testing in production is the only viable capacity test at Zalando's scale. Explicit stated claim: "we tried many approaches and given our experience, the only way we found effective for a large-scale system like ours are live load tests in production." The implementation is two simulators: (1) a sales-order simulator that places test orders on clearly-distinguishable test products, processed through inventory + payment + stopped at fulfillment; (2) a user-journey simulator that drives the key customer touchpoints across all countries with sales-event traffic shape. "Mistakes become really costly as the customer experience is degraded and thus this approach requires the ability to quickly notice customer impact and react by aborting the test or mitigating the incident otherwise." Canonicalised as patterns/live-load-test-in-production. Complements the pre-existing patterns/load-test-at-scale (which is about pre-migration load testing) at a different altitude: this one is ongoing capacity-planning discipline.

  • Build-vs-buy on the load-test harness: built, tuned for two years, then bought. "Having written and evolved the user journey simulator for two years we were not fully satisfied with its abilities to generate load at scale. There were too many rough edges and tuning the simulator to be able to generate the required load profiles and investing our development time was very time consuming. We decided that it's better to leverage an existing product that will do the job better. This paid off heavily as last year we were able to run the tests both on App and Web platforms simultaneously." A direct data point on reuse existing infrastructure at a different granularity — here it's retire in-house tool in favor of vendor product once the in-house tool's marginal scaling cost exceeded the vendor's licensing cost.

  • Cyber Week as a capability-investment forcing function. "Thanks to the high priority of the Cyber Week preparations, every year we are able to invest in a key theme that helps us build up new capabilities that we did not have before." Named year-over-year investments: resilience engineering know-how, load testing in production, capacity planning, PRR, cross-team collaboration. Canonicalised as patterns/annual-peak-event-as-capability-forcing-function. The org-design claim: a recurring, business-critical, high- attention event is a mechanism to get platform investment attention that wouldn't otherwise be prioritized.

  • Situation Room as the on-event control-center pattern. "For the key period where we expect the highest load on our systems, we organize a Situation Room to ensure rapid incident response. In the room, we gather representatives from key engineering teams, SRE team, and dedicated Incident Commanders to closely watch the operational performance of our platform. It's basically a control center with dozens of screens and graphs." Canonicalised as patterns/situation-room-for-peak-event. Complements but distinct from general on-call / incident management — the Situation Room is time-bounded (Cyber Week only), staffed by representatives from key teams (not just an on-call), and observationally biased (watching dashboards, not responding to pages).

  • Over-provisioned on-prem → auto-scaling cloud forced capacity testing. Phase-1 framing: "Our backend services [...] however, formed an over-provisioned system with a fixed number of instances in the Data Center." When the org moved to AWS-backed Kubernetes with auto-scaling, the over-provisioning buffer disappeared. Stated rule: "In a cloud-based system that relies heavily on auto-scaling for cost-optimization, proper testing and capacity planning is a must." This is the architectural driver of the live-load-testing discipline — the move to cloud created the need. Generalises: any org moving from over-provisioned on-prem to cloud auto-scaling inherits a load-testing obligation.

Operational numbers

  • 2019 Cyber Week: 840,000 new customers acquired; GMV +32% YoY; peak 7,200 orders/min (vs 4,200 in 2018; +71% YoY).
  • Baseline growth rate: 20–25% YoY (outside of sales events).
  • On-call teams: grew from "a handful" (~6 years prior, 2014) to ~100 teams today (2020).
  • SRE grassroots team: 10 engineers performing production readiness reviews ahead of Cyber Week.
  • Cyber Week scope: 1,122 applications (out of 4,000+ apps in the full Zalando landscape) formally included in Cyber Week prep.
  • Preparation lead-time: seven weeks between the post's publication and Cyber Week 2020.
  • Cloud migration trigger: Solr (Product Data + Search) fleet first to cloud "six years ago" (2014) due to multi-month physical-server lead times in the data center.
  • Stack specifics: distributed tracing via OpenTracing (standard); backend relational store PostgreSQL (heavily sharded, SSD-migrated); per-country load tests.

Systems and concepts surfaced

Caveats

  • No architecture diagrams. The post is a narrative retrospective, not a technical deep-dive. Load-testing tooling (the vendor product that replaced the in-house simulator) is not named; simulator internals are described at a paragraph level, not as pseudo-code. Adaptive paging's heuristics are cited as "a set of heuristics" without enumeration — the SRECon 2019 talk is the deeper reference, not this post.
  • 2020 remote-work footnote is not answered. "This year has an added twist of remote working, which likely will require us to rethink how to organize the Situation Room efficiently." The follow-up post doesn't exist in this ingest; the 2020 Cyber Week outcome is only teased.
  • Tier-2 mix of organizational + architectural content. Some sections are primarily org-structure ("dedicated Program Managers", "every year we tune the structure and reporting within this project") rather than systems. In-scope on balance because the SRE-evolution + load-testing-discipline content is substantive and concrete, but not a pure internals piece like the PgBouncer-on-Kubernetes sibling.
  • Post predates OpenTelemetry maturity. OpenTracing was merged into OpenTelemetry in 2019 and OpenTelemetry has since superseded it as the de-facto standard. The adaptive-paging mechanism is spec-agnostic (it uses span parent-child relationships + semantic convention tags, both carried over to OpenTelemetry), so the primitive remains current.

Source

Last updated · 476 distilled / 1,218 read