Skip to content

CONCEPT Cited by 1 source

A/B test design audit

Definition

A/B test design audit is a quality-review process applied to an A/B test before it goes live, checking that the test design meets agreed-on trustworthiness criteria. It is the experimentation-platform analogue of code review for experiments: a gate between I want to run this test and my test is collecting results against the production user base.

Why it exists

A well-instrumented platform can still produce untrustworthy results if the test itself was designed badly: - Hypothesis is not testable. - Problem statement is vague → KPI choice is arbitrary. - Outcome KPI is noise-dominated at feasible sample sizes. - Stopping criteria are retrofitted ("we'll stop when we like what we see"). - Runtime is too short to detect the effect size claimed.

Zalando's Walk-phase observation (Source: sources/2021-01-11-zalando-experimentation-platform-at-zalando-part-1-evolution) was that A/B tests across teams had varying quality levels, which motivated a standard audit.

Audit dimensions (Zalando's list)

Each of the following must be reviewed before an A/B test is approved on Octopus:

  • Testable hypothesis — a falsifiable statement, not a vague goal.
  • Clear problem statement — what is being changed, for whom, why.
  • Clear outcome KPI — pre-committed, documented, measurable (see concepts/overall-evaluation-criterion).
  • A/B test runtime — computed from power analysis and expected effect size, not picked arbitrarily.
  • Stopping criteria based on the plan — no ad-hoc peeking; no "stop when significant" without multiple-comparisons correction.

Operational mechanics

Zalando supplements the audit with: - Weekly consultation hours where teams can bring proposals in early and iterate with the experimentation team. - Internal blogs sharing tips on effective A/B testing (reinforcing the same criteria at the IC level, so teams self-audit before formal review). - Peer review of analysis methods with applied scientists from other teams — an audit on the platform's methods, not just individual tests.

Contrast with automated data-quality checks

The audit is about design-time trustworthiness — will this test, if run perfectly, produce a meaningful result? Data-quality checks (such as concepts/sample-ratio-mismatch detection) are the runtime counterpart. Both are required; neither substitutes for the other.

Seen in

Last updated · 476 distilled / 1,218 read