CONCEPT Cited by 1 source
Filter before A/B test¶
Definition¶
Filter before A/B test is the experimentation-workflow position in which a cheap pre-filter — typically back-testing, simulation, or offline evaluation — is run before committing a candidate to a live A/B test. The cheap filter eliminates candidates that are obvious non-starters; A/B testing validates the survivors on live traffic.
The workflow has two phases with different purposes:
- Discovery (back-test / simulation) — fast, cheap, wide exploration of the candidate space. False positives are tolerable because the A/B phase will catch them.
- Validation (A/B test) — slow, expensive, narrow confirmation on live traffic. High-rigour statistical decisions, possibly with percentile guardrails (see patterns/ab-test-rollout).
Why the split exists¶
A/B tests are expensive:
- Opportunity cost — traffic you assign to a treatment can't be assigned to another experiment.
- Time cost — experiments need to run until statistical significance; weeks for small effects.
- Risk cost — mistakes affect real users / real money.
- Sample-size constraints — if randomisation must be at a unit coarser than the user (advertiser, shop, city), the sample sizes shrink.
Back-testing flips the cost profile: you can evaluate many candidates cheaply against historical data in hours, eliminate clearly-worse ones, and spend A/B capacity on the survivors.
Yelp's instance¶
Yelp's Back-Testing Engine (2026-02-02) is positioned explicitly as the discovery phase upstream of A/B testing. Verbatim from the post:
"Instead of relying solely on A/B tests, we can affordably and safely simulate a wide range of changes using historical data. This allows us to quickly filter out less ideal candidates and focus A/B tests only on the most promising ideas, preserving A/B testing for final validation rather than discovery."
The pattern is not to replace A/B testing — it is to re-allocate A/B capacity to validated candidates.
Properties¶
- The filter doesn't have to be accurate in absolute terms — it just has to rank-order candidates similarly enough to true performance that A/B's top picks are in the filter's shortlist. Precision matters less than recall.
- The A/B step remains the source of truth. If the filter and the A/B disagree, the A/B wins. This is the mitigation for concepts/overfitting-to-historical-data: even a filter that overfits to history is safe as long as its false positives lose in A/B.
- Feedback from A/B can improve the filter. If back-tests consistently predict bigger wins than A/B tests show, recalibrate the back-testing model (e.g. retrain the counterfactual-outcome predictor). Yelp doesn't describe this closed loop explicitly but the architecture supports it.
Applicability beyond ad-tech¶
The pattern generalises to any domain with expensive live experimentation:
- Recommendation systems — offline replay via logged- bandit evaluation (IPS, doubly robust) as pre-filter to A/B.
- Search ranking — offline NDCG / MRR on held-out queries as pre-filter to interleaving A/B.
- Pricing / auction design — simulation with learned demand models as pre-filter to live A/B.
- System configuration — replay traffic in a staging cluster before canarying in production.
Relation to other patterns¶
- patterns/ab-test-rollout — the validation phase this concept feeds into; often paired with percentile guardrails.
- patterns/snapshot-replay-agent-evaluation — a sibling filter-before-deploy shape in agent engineering: replay historical agent inputs against new code to filter candidates before human eval.
Seen in¶
- sources/2026-02-02-yelp-back-testing-engine-ad-budget-allocation — canonical wiki instance. The whole Back-Testing Engine exists to enable this workflow.