Skip to content

SYSTEM Cited by 1 source

Octopus (Zalando Experimentation Platform)

What it is

Octopus is Zalando's in-house A/B testing / experimentation platform. The first version was released in 2015, named after Paul the Octopus — the FIFA 2010 mascot who correctly picked match winners at a low error rate. The platform's architecture has three parts (Source: sources/2021-01-11-zalando-experimentation-platform-at-zalando-part-1-evolution):

  1. Experiment management — configure, schedule, audit A/B tests from one place.
  2. Experiment execution — randomization engine that assigns users to variants; latency-sensitive for applications like product-detail-page variants.
  3. Experiment analysis — runs statistical tests on the collected tracking events and surfaces results to experimenters.

Why it exists

Before Octopus, A/B tests at Zalando were set up manually by each team. This had two failures: (a) test quality could not be guaranteed, (b) the company did not even know whether teams actually ran tests before making product decisions. Octopus centralises randomization + analysis method + KPI definitions, turning A/B testing into a standard org-wide primitive (see patterns/centralized-experimentation-platform).

Key architectural choice: open-source stats library + production wrapper

The inaugural team (engineers + data scientists, little overlap in domain knowledge: scientists didn't know Scala; engineers didn't know statistics) decoupled their workstreams by building an open-source statistics library that the Scala backend wraps as a production service (see patterns/open-source-wrapped-by-production-system). This let each subgroup iterate in its native language + tooling without blocking the other.

Default statistical method

Octopus runs a two-sided t-test at 5% significance by default. Non-inferiority tests and Bayesian methods are identified as improvement areas in peer review (see concepts/non-inferiority-test).

Platform features accumulated over time

The analysis-system rewrite

Octopus's initial analysis system hit architectural ceilings when concurrent-A/B-test load grew. Maintenance cost grew so high that the team lost capacity to improve analysis methods. Zalando rebuilt the analysis system on Spark, a project that took ~2 years (see systems/apache-spark; technical details promised in a subsequent post in the series).

Operational numbers

  • 5% — default t-test significance level
  • ~3 weeks — median A/B test runtime at Zalando (higher than industry peers — future work: variance reduction + Bayesian + multi-armed bandit to speed up)
  • 20%+ — historical SRM rate at Zalando before remediation (industry peers: 6–10%); pushed data-tracking-schema re-unification work across the org
  • ~2 years — analysis-system Spark rebuild

Seen in

Last updated · 476 distilled / 1,218 read