SYSTEM Cited by 1 source
Octopus (Zalando Experimentation Platform)¶
What it is¶
Octopus is Zalando's in-house A/B testing / experimentation platform. The first version was released in 2015, named after Paul the Octopus — the FIFA 2010 mascot who correctly picked match winners at a low error rate. The platform's architecture has three parts (Source: sources/2021-01-11-zalando-experimentation-platform-at-zalando-part-1-evolution):
- Experiment management — configure, schedule, audit A/B tests from one place.
- Experiment execution — randomization engine that assigns users to variants; latency-sensitive for applications like product-detail-page variants.
- Experiment analysis — runs statistical tests on the collected tracking events and surfaces results to experimenters.
Why it exists¶
Before Octopus, A/B tests at Zalando were set up manually by each team. This had two failures: (a) test quality could not be guaranteed, (b) the company did not even know whether teams actually ran tests before making product decisions. Octopus centralises randomization + analysis method + KPI definitions, turning A/B testing into a standard org-wide primitive (see patterns/centralized-experimentation-platform).
Key architectural choice: open-source stats library + production wrapper¶
The inaugural team (engineers + data scientists, little overlap in domain knowledge: scientists didn't know Scala; engineers didn't know statistics) decoupled their workstreams by building an open-source statistics library that the Scala backend wraps as a production service (see patterns/open-source-wrapped-by-production-system). This let each subgroup iterate in its native language + tooling without blocking the other.
Default statistical method¶
Octopus runs a two-sided t-test at 5% significance by default. Non-inferiority tests and Bayesian methods are identified as improvement areas in peer review (see concepts/non-inferiority-test).
Platform features accumulated over time¶
- Traffic ramp-up for controlled rollouts — gradually increase the fraction of users exposed to a variant (see patterns/controlled-rollout-with-traffic-rampup).
- Feature toggles as first-class primitives (Octopus cites Fowler's canonical definition).
- Quasi-experimental methods — guidelines and software packages for teams whose use case cannot be cleanly A/B-tested (e.g. comparing two countries) (see concepts/quasi-experimental-methods).
- Automated sample ratio mismatch (SRM) alerts — Octopus automatically raises an alert to the affected team when SRM is detected, requiring data investigation before results are released (see patterns/automated-srm-alert, concepts/sample-ratio-mismatch).
- A/B-test design audit process + weekly consultation hours (see concepts/ab-test-design-audit).
The analysis-system rewrite¶
Octopus's initial analysis system hit architectural ceilings when concurrent-A/B-test load grew. Maintenance cost grew so high that the team lost capacity to improve analysis methods. Zalando rebuilt the analysis system on Spark, a project that took ~2 years (see systems/apache-spark; technical details promised in a subsequent post in the series).
Operational numbers¶
- 5% — default t-test significance level
- ~3 weeks — median A/B test runtime at Zalando (higher than industry peers — future work: variance reduction + Bayesian + multi-armed bandit to speed up)
- 20%+ — historical SRM rate at Zalando before remediation (industry peers: 6–10%); pushed data-tracking-schema re-unification work across the org
- ~2 years — analysis-system Spark rebuild
Seen in¶
- sources/2021-01-11-zalando-experimentation-platform-at-zalando-part-1-evolution — Part 1: evolution / org lessons
Related¶
- companies/zalando
- systems/apache-spark — new analysis-system substrate
- patterns/centralized-experimentation-platform
- patterns/controlled-rollout-with-traffic-rampup
- patterns/open-source-wrapped-by-production-system
- patterns/automated-srm-alert
- concepts/experimentation-evolution-model-fabijan
- concepts/sample-ratio-mismatch
- concepts/experimentation-culture
- concepts/ab-test-design-audit
- concepts/overall-evaluation-criterion
- concepts/non-inferiority-test
- concepts/quasi-experimental-methods