PATTERN

Centralized experimentation platform¶

Problem¶

Teams run A/B tests ad-hoc, each with their own randomization logic, analysis scripts, and KPI definitions. Three failure modes follow:

Test-quality is unverifiable — each team's stats pipeline has its own bugs, assumptions, power calculations, and stopping rules.
No org-level visibility — leadership has no idea whether a given product decision was made against A/B test data or against a PM's intuition. Individual teams can claim they "A/B-tested" without producing evidence.
KPIs drift across teams — "conversion rate" means one thing in search and a different thing in checkout; results are not comparable.

Solution¶

Centralise A/B testing as a single org-wide platform that owns: - Randomization engine (experiment assignment, exposure, logging). - Analysis methods (statistical tests, multiple-comparisons correction, SRM detection — see concepts/sample-ratio-mismatch). - KPI definitions + OEC guidance (see concepts/overall-evaluation-criterion). - Design audit + consultation (see concepts/ab-test-design-audit). - Rollout primitives (see patterns/controlled-rollout-with-traffic-rampup).

Teams bring the hypothesis and the KPI; the platform provides everything else. The platform becomes the shared contract between teams and between the company's product decisions and its statistics.

Zalando's applied case¶

systems/octopus-zalando-experimentation-platform was released in 2015 after exactly the failure mode above: A/B tests were set up by each team individually and manually, Zalando could "neither ensure A/B test quality, nor know whether product teams actually ran A/B tests before making decisions" (Source: ). The rest of the post is the 2015–2020 journey of making that platform both scalable and trustworthy — the Walk phase of concepts/experimentation-evolution-model-fabijan.

Key design choices that followed¶

Open-source library wrapped by production system — the architectural primitive that let scientists and engineers collaborate despite the initial domain-knowledge gap (see patterns/open-source-wrapped-by-production-system).
Standard two-sided t-test with 5% significance — single default method, uniform across teams.
Analysis-system rebuild on Spark — ~2 years of work when the initial system couldn't handle concurrent-A/B-test load; the critical-path infrastructure investment.
Automated SRM alerts as the data-quality backstop (see patterns/automated-srm-alert).

When not to centralise¶

Not every experimentation problem belongs on the single A/B platform. Octopus deliberately ships concepts/quasi-experimental-methods guidance for use cases where A/B is infeasible (country comparisons). Forcing everything into user-split A/B when it isn't the right tool produces bad science wrapped in platform trust.

Seen in¶

— Octopus, released 2015, direct remediation for ad-hoc A/B