CONCEPT Cited by 1 source
A/B test design audit¶
Definition¶
A/B test design audit is a quality-review process applied to an A/B test before it goes live, checking that the test design meets agreed-on trustworthiness criteria. It is the experimentation-platform analogue of code review for experiments: a gate between I want to run this test and my test is collecting results against the production user base.
Why it exists¶
A well-instrumented platform can still produce untrustworthy results if the test itself was designed badly: - Hypothesis is not testable. - Problem statement is vague → KPI choice is arbitrary. - Outcome KPI is noise-dominated at feasible sample sizes. - Stopping criteria are retrofitted ("we'll stop when we like what we see"). - Runtime is too short to detect the effect size claimed.
Zalando's Walk-phase observation (Source: sources/2021-01-11-zalando-experimentation-platform-at-zalando-part-1-evolution) was that A/B tests across teams had varying quality levels, which motivated a standard audit.
Audit dimensions (Zalando's list)¶
Each of the following must be reviewed before an A/B test is approved on Octopus:
- Testable hypothesis — a falsifiable statement, not a vague goal.
- Clear problem statement — what is being changed, for whom, why.
- Clear outcome KPI — pre-committed, documented, measurable (see concepts/overall-evaluation-criterion).
- A/B test runtime — computed from power analysis and expected effect size, not picked arbitrarily.
- Stopping criteria based on the plan — no ad-hoc peeking; no "stop when significant" without multiple-comparisons correction.
Operational mechanics¶
Zalando supplements the audit with: - Weekly consultation hours where teams can bring proposals in early and iterate with the experimentation team. - Internal blogs sharing tips on effective A/B testing (reinforcing the same criteria at the IC level, so teams self-audit before formal review). - Peer review of analysis methods with applied scientists from other teams — an audit on the platform's methods, not just individual tests.
Contrast with automated data-quality checks¶
The audit is about design-time trustworthiness — will this test, if run perfectly, produce a meaningful result? Data-quality checks (such as concepts/sample-ratio-mismatch detection) are the runtime counterpart. Both are required; neither substitutes for the other.
Seen in¶
- sources/2021-01-11-zalando-experimentation-platform-at-zalando-part-1-evolution — 5-dimension audit + weekly consultation hours as Walk-phase trustworthiness move