PATTERN Cited by 1 source
Centralized experimentation platform¶
Problem¶
Teams run A/B tests ad-hoc, each with their own randomization logic, analysis scripts, and KPI definitions. Three failure modes follow:
- Test-quality is unverifiable — each team's stats pipeline has its own bugs, assumptions, power calculations, and stopping rules.
- No org-level visibility — leadership has no idea whether a given product decision was made against A/B test data or against a PM's intuition. Individual teams can claim they "A/B-tested" without producing evidence.
- KPIs drift across teams — "conversion rate" means one thing in search and a different thing in checkout; results are not comparable.
Solution¶
Centralise A/B testing as a single org-wide platform that owns: - Randomization engine (experiment assignment, exposure, logging). - Analysis methods (statistical tests, multiple-comparisons correction, SRM detection — see concepts/sample-ratio-mismatch). - KPI definitions + OEC guidance (see concepts/overall-evaluation-criterion). - Design audit + consultation (see concepts/ab-test-design-audit). - Rollout primitives (see patterns/controlled-rollout-with-traffic-rampup).
Teams bring the hypothesis and the KPI; the platform provides everything else. The platform becomes the shared contract between teams and between the company's product decisions and its statistics.
Zalando's applied case¶
systems/octopus-zalando-experimentation-platform was released in 2015 after exactly the failure mode above: A/B tests were set up by each team individually and manually, Zalando could "neither ensure A/B test quality, nor know whether product teams actually ran A/B tests before making decisions" (Source: sources/2021-01-11-zalando-experimentation-platform-at-zalando-part-1-evolution). The rest of the post is the 2015–2020 journey of making that platform both scalable and trustworthy — the Walk phase of concepts/experimentation-evolution-model-fabijan.
Key design choices that followed¶
- Open-source library wrapped by production system — the architectural primitive that let scientists and engineers collaborate despite the initial domain-knowledge gap (see patterns/open-source-wrapped-by-production-system).
- Standard two-sided t-test with 5% significance — single default method, uniform across teams.
- Analysis-system rebuild on Spark — ~2 years of work when the initial system couldn't handle concurrent-A/B-test load; the critical-path infrastructure investment.
- Automated SRM alerts as the data-quality backstop (see patterns/automated-srm-alert).
When not to centralise¶
Not every experimentation problem belongs on the single A/B platform. Octopus deliberately ships concepts/quasi-experimental-methods guidance for use cases where A/B is infeasible (country comparisons). Forcing everything into user-split A/B when it isn't the right tool produces bad science wrapped in platform trust.
Seen in¶
- sources/2021-01-11-zalando-experimentation-platform-at-zalando-part-1-evolution — Octopus, released 2015, direct remediation for ad-hoc A/B
Related¶
- systems/octopus-zalando-experimentation-platform
- concepts/experimentation-evolution-model-fabijan
- concepts/experimentation-culture
- concepts/ab-test-design-audit
- concepts/sample-ratio-mismatch
- concepts/overall-evaluation-criterion
- patterns/open-source-wrapped-by-production-system
- patterns/controlled-rollout-with-traffic-rampup
- patterns/automated-srm-alert