SYSTEM

Octopus (Zalando Experimentation Platform)¶

What it is¶

Octopus is Zalando's in-house A/B testing / experimentation platform. The first version was released in 2015, named after Paul the Octopus — the FIFA 2010 mascot who correctly picked match winners at a low error rate. The platform's architecture has three parts (Source: ):

Experiment management — configure, schedule, audit A/B tests from one place.
Experiment execution — randomization engine that assigns users to variants; latency-sensitive for applications like product-detail-page variants.
Experiment analysis — runs statistical tests on the collected tracking events and surfaces results to experimenters.

Why it exists¶

Before Octopus, A/B tests at Zalando were set up manually by each team. This had two failures: (a) test quality could not be guaranteed, (b) the company did not even know whether teams actually ran tests before making product decisions. Octopus centralises randomization + analysis method + KPI definitions, turning A/B testing into a standard org-wide primitive (see patterns/centralized-experimentation-platform).

Key architectural choice: open-source stats library + production wrapper¶

The inaugural team (engineers + data scientists, little overlap in domain knowledge: scientists didn't know Scala; engineers didn't know statistics) decoupled their workstreams by building an open-source statistics library that the Scala backend wraps as a production service (see patterns/open-source-wrapped-by-production-system). This let each subgroup iterate in its native language + tooling without blocking the other.

Default statistical method¶

Octopus runs a two-sided t-test at 5% significance by default. Non-inferiority tests and Bayesian methods are identified as improvement areas in peer review (see concepts/non-inferiority-test).

Platform features accumulated over time¶

Traffic ramp-up for controlled rollouts — gradually increase the fraction of users exposed to a variant (see patterns/controlled-rollout-with-traffic-rampup).
Feature toggles as first-class primitives (Octopus cites Fowler's canonical definition).
Quasi-experimental methods — guidelines and software packages for teams whose use case cannot be cleanly A/B-tested (e.g. comparing two countries) (see concepts/quasi-experimental-methods).
Automated sample ratio mismatch (SRM) alerts — Octopus automatically raises an alert to the affected team when SRM is detected, requiring data investigation before results are released (see patterns/automated-srm-alert, concepts/sample-ratio-mismatch).
A/B-test design audit process + weekly consultation hours (see concepts/ab-test-design-audit).

The analysis-system rewrite¶

Octopus's initial analysis system hit architectural ceilings when concurrent-A/B-test load grew. Maintenance cost grew so high that the team lost capacity to improve analysis methods. Zalando rebuilt the analysis system on Spark, a project that took ~2 years (see systems/apache-spark; technical details promised in a subsequent post in the series).

Operational numbers¶

5% — default t-test significance level
~3 weeks — median A/B test runtime at Zalando (higher than industry peers — future work: variance reduction + Bayesian + multi-armed bandit to speed up)
20%+ — historical SRM rate at Zalando before remediation (industry peers: 6–10%); pushed data-tracking-schema re-unification work across the org
~2 years — analysis-system Spark rebuild

Seen in¶

— Part 1: evolution / org lessons