Skip to content

PATTERN Cited by 1 source

Centralized experimentation platform

Problem

Teams run A/B tests ad-hoc, each with their own randomization logic, analysis scripts, and KPI definitions. Three failure modes follow:

  1. Test-quality is unverifiable — each team's stats pipeline has its own bugs, assumptions, power calculations, and stopping rules.
  2. No org-level visibility — leadership has no idea whether a given product decision was made against A/B test data or against a PM's intuition. Individual teams can claim they "A/B-tested" without producing evidence.
  3. KPIs drift across teams — "conversion rate" means one thing in search and a different thing in checkout; results are not comparable.

Solution

Centralise A/B testing as a single org-wide platform that owns: - Randomization engine (experiment assignment, exposure, logging). - Analysis methods (statistical tests, multiple-comparisons correction, SRM detection — see concepts/sample-ratio-mismatch). - KPI definitions + OEC guidance (see concepts/overall-evaluation-criterion). - Design audit + consultation (see concepts/ab-test-design-audit). - Rollout primitives (see patterns/controlled-rollout-with-traffic-rampup).

Teams bring the hypothesis and the KPI; the platform provides everything else. The platform becomes the shared contract between teams and between the company's product decisions and its statistics.

Zalando's applied case

systems/octopus-zalando-experimentation-platform was released in 2015 after exactly the failure mode above: A/B tests were set up by each team individually and manually, Zalando could "neither ensure A/B test quality, nor know whether product teams actually ran A/B tests before making decisions" (Source: sources/2021-01-11-zalando-experimentation-platform-at-zalando-part-1-evolution). The rest of the post is the 2015–2020 journey of making that platform both scalable and trustworthy — the Walk phase of concepts/experimentation-evolution-model-fabijan.

Key design choices that followed

  • Open-source library wrapped by production system — the architectural primitive that let scientists and engineers collaborate despite the initial domain-knowledge gap (see patterns/open-source-wrapped-by-production-system).
  • Standard two-sided t-test with 5% significance — single default method, uniform across teams.
  • Analysis-system rebuild on Spark — ~2 years of work when the initial system couldn't handle concurrent-A/B-test load; the critical-path infrastructure investment.
  • Automated SRM alerts as the data-quality backstop (see patterns/automated-srm-alert).

When not to centralise

Not every experimentation problem belongs on the single A/B platform. Octopus deliberately ships concepts/quasi-experimental-methods guidance for use cases where A/B is infeasible (country comparisons). Forcing everything into user-split A/B when it isn't the right tool produces bad science wrapped in platform trust.

Seen in

Last updated · 476 distilled / 1,218 read