Zalando — Experimentation Platform at Zalando: Part 1 - Evolution¶
Summary¶
A retrospective by the Zalando experimentation team on six years of building an in-house A/B testing platform (2015–2020+), structured around Fabijan et al.'s Experimentation Evolution Model (crawl → walk → run). The post introduces Octopus, Zalando's in-house A/B testing platform, and walks through the engineering and organisational challenges at each phase: cross-functional knowledge gaps at inception, then scalability + trustworthiness concerns as adoption grew, then advanced statistical methods and data-quality automation as the platform matured. The post is Part 1 of a series; subsequent parts cover the experimentation engine, analysis system (rebuilt on Spark), data quality, and data visualization technical details.
Key takeaways¶
-
Centralized experimentation beats ad-hoc team-by-team A/B tests on both test quality and organisational visibility. Before Octopus (pre-2015), each team ran A/B tests manually; Zalando discovered it could neither ensure test quality nor even know whether teams ran tests before decisions. A central platform makes randomization, analysis methods, and KPI definitions uniform. (See patterns/centralized-experimentation-platform.)
-
"Open-source statistics library wrapped by production backend" was the inaugural architectural decoupling that let scientists (Python, statistics) and engineers (Scala, production systems) collaborate without either needing to learn the other's stack. The library became the shared contract between the two subgroups. (See patterns/open-source-wrapped-by-production-system.)
-
Sample ratio mismatch (SRM) is the single most important data quality indicator for A/B test trustworthiness. Industry peers report 6–10% of A/B tests affected; Zalando's historical data showed at least 20% of A/B tests had SRM — double the industry rate, pointing to data-tracking inconsistency across teams. Octopus now auto-alerts the affected team on detection. (See concepts/sample-ratio-mismatch, patterns/automated-srm-alert.)
-
Data tracking consistency is a distributed-systems problem at the org level. Despite a dedicated tracking team ingesting events and unifying schemas, some product teams defined their own tracking event schemas, causing corrupted / missing data downstream. Resolution required extensive cross-team communication + reorganisation — not a platform feature.
-
Analysis-system rewrites take years. Octopus's initial analysis system hit scalability limits from architectural constraints at concurrent-A/B-test load; maintenance cost grew so high that method improvements stalled. Two years to rebuild on Spark. Lesson: analysis-system scalability is on the critical path for all other experimentation work.
-
Two-sided t-test with 5% significance is Zalando's default analysis method; non-inferiority tests were identified in peer review as an improvement area. (See concepts/non-inferiority-test.)
-
A/B testing is not always the right tool. For comparing performance between two countries (where users cannot be randomised cleanly), Octopus provides quasi-experimental methods instead. Guidelines + software packages help analysts pick the right causal-inference tool. (See concepts/quasi-experimental-methods.)
-
Feature toggles + traffic ramp-up are first-class platform features, not application-layer concerns. Octopus exposes controlled-rollout primitives so teams can gradually increase exposure to a variant, coordinate multi-team launches, and avoid accidentally showing buggy variants to many users. (See patterns/controlled-rollout-with-traffic-rampup, citing Martin Fowler's "Feature Toggles".)
-
OEC / KPI selection is a team-specific art. Zalando's qualitative guidelines: (a) KPIs should be team-specific and drivable by product features that team controls; (b) KPIs should be proxies for long-term customer lifetime value, not short-term revenue. (See concepts/overall-evaluation-criterion.)
-
Median A/B-test runtime at Zalando is ~3 weeks — higher than peer tech companies. Future work listed: variance reduction, Bayesian analysis, multi-armed bandit for faster trustworthy experimentation.
Systems and concepts extracted¶
System: systems/octopus-zalando-experimentation-platform — in-house experimentation platform named after Paul the Octopus (FIFA 2010 mascot). Three parts: experiment management, experiment execution, experiment analysis.
Concepts:
- concepts/experimentation-evolution-model-fabijan — the crawl / walk / run framework from Fabijan et al., 2017 (ICSE) that structures this entire post.
- concepts/sample-ratio-mismatch — the data quality indicator; Zalando cites Fabijan 2019 (KDD).
- concepts/experimentation-culture — "data-driven decisions" as an organisational commitment, not a tool.
- concepts/ab-test-design-audit — quality review of hypothesis, problem statement, outcome KPI, runtime, stopping criteria before test goes live.
- concepts/non-inferiority-test — statistical test for "not-meaningfully-worse-than-control", complement to classical two-sided t-test.
- concepts/quasi-experimental-methods — causal inference when randomization is infeasible (country comparisons, marketplace-level treatments).
- concepts/overall-evaluation-criterion — the single composite metric an experiment is decided on; team-specific and LTV-proxying.
Patterns:
- patterns/centralized-experimentation-platform — platformise A/B testing org-wide instead of letting teams roll their own.
- patterns/controlled-rollout-with-traffic-rampup — gradual traffic exposure + feature toggles as platform-provided primitives.
- patterns/open-source-wrapped-by-production-system — the decoupling Octopus used between Python statistics library and Scala backend.
- patterns/automated-srm-alert — automated platform-level alert to experiment owner on SRM detection.
Operational numbers¶
- 2015 — first version of Octopus released.
- 2020-early — observable decrease in number of running A/B tests at Zalando; dual causes: (a) large-scale coordinated product initiatives that were not A/B testable, (b) pause recommendations due to abnormal user behaviour at COVID-19 onset in Europe.
- 5% — significance level used by Octopus's default two-sided t-test.
- 6–10% — industry-reported rate of A/B tests with sample ratio mismatch (peer companies similar to Zalando).
- 20%+ — Zalando's historical rate of A/B tests with SRM before remediation — ~2× industry rate.
- ~2 years — time to rebuild Octopus's analysis system on Spark.
- ~3 weeks — median A/B-test runtime at Zalando, higher than industry peers.
- 3 — subsystems in Octopus: management, execution, analysis.
Caveats¶
- 2021-dated post on a 2015–2020 retrospective. The specific analysis methods (two-sided t-test, 5% significance level) reflect the state-of-practice at Zalando circa 2020; more recent experimentation industry work has moved toward Bayesian analysis, CUPED-style variance reduction, and sequential testing as defaults. Zalando lists these as future work.
- Sparse on architecture. This is the evolution post — org-level + platform-level lessons. Technical detail on the experimentation engine, analysis system, data quality, and visualization are deferred to subsequent posts in the series.
- 20% SRM rate is Zalando-specific. Root cause was data-tracking inconsistency across teams that self-defined schemas; not a universal finding. Organisations with a strongly-enforced tracking schema should expect closer to the 6–10% industry number.
- "Medium runtime 3 weeks is higher than peers" is reported without a specific peer benchmark; the direction (want faster) is clearer than the magnitude.
- Skipped speculative framings. The post mentions "stable unit assumption" (cross-device identity resolution as SUTVA challenge) as a future-work bullet; the wiki does not file a concept page for it from this source alone, since the article is an intro-level mention.
Source¶
- Original: https://engineering.zalando.com/posts/2021/01/experimentation-platform-part1.html
- Raw markdown:
raw/zalando/2021-01-11-experimentation-platform-at-zalando-part-1-evolution-60b69fd7.md
Related¶
- companies/zalando — Zalando Engineering on the wiki
- concepts/filter-before-ab-test — Yelp's pre-A/B filter pattern, complementary to this platform's analysis-tool selection (OEC / quasi-experimental)
- concepts/user-split-experiment — the classical shape Octopus implements by default
- patterns/ab-test-rollout — Atlassian's percentile-guardrail A/B rollout pattern, complementary to Octopus's ramp-up + feature-toggle primitives
- concepts/bayesian-optimization-over-parameter-space — adjacent advanced method Zalando lists as future work