CONCEPT

A/B test design audit¶

Definition¶

A/B test design audit is a quality-review process applied to an A/B test before it goes live, checking that the test design meets agreed-on trustworthiness criteria. It is the experimentation-platform analogue of code review for experiments: a gate between I want to run this test and my test is collecting results against the production user base.

Why it exists¶

A well-instrumented platform can still produce untrustworthy results if the test itself was designed badly: - Hypothesis is not testable. - Problem statement is vague → KPI choice is arbitrary. - Outcome KPI is noise-dominated at feasible sample sizes. - Stopping criteria are retrofitted ("we'll stop when we like what we see"). - Runtime is too short to detect the effect size claimed.

Zalando's Walk-phase observation (Source: ) was that A/B tests across teams had varying quality levels, which motivated a standard audit.

Audit dimensions (Zalando's list)¶

Each of the following must be reviewed before an A/B test is approved on Octopus:

Testable hypothesis — a falsifiable statement, not a vague goal.
Clear problem statement — what is being changed, for whom, why.
Clear outcome KPI — pre-committed, documented, measurable (see concepts/overall-evaluation-criterion).
A/B test runtime — computed from power analysis and expected effect size, not picked arbitrarily.
Stopping criteria based on the plan — no ad-hoc peeking; no "stop when significant" without multiple-comparisons correction.

Operational mechanics¶

Zalando supplements the audit with: - Weekly consultation hours where teams can bring proposals in early and iterate with the experimentation team. - Internal blogs sharing tips on effective A/B testing (reinforcing the same criteria at the IC level, so teams self-audit before formal review). - Peer review of analysis methods with applied scientists from other teams — an audit on the platform's methods, not just individual tests.

Contrast with automated data-quality checks¶

The audit is about design-time trustworthiness — will this test, if run perfectly, produce a meaningful result? Data-quality checks (such as concepts/sample-ratio-mismatch detection) are the runtime counterpart. Both are required; neither substitutes for the other.

Seen in¶

— 5-dimension audit + weekly consultation hours as Walk-phase trustworthiness move