CONCEPT Cited by 1 source

Filter before A/B test¶

Definition¶

Filter before A/B test is the experimentation-workflow position in which a cheap pre-filter — typically back-testing, simulation, or offline evaluation — is run before committing a candidate to a live A/B test. The cheap filter eliminates candidates that are obvious non-starters; A/B testing validates the survivors on live traffic.

The workflow has two phases with different purposes:

Discovery (back-test / simulation) — fast, cheap, wide exploration of the candidate space. False positives are tolerable because the A/B phase will catch them.
Validation (A/B test) — slow, expensive, narrow confirmation on live traffic. High-rigour statistical decisions, possibly with percentile guardrails (see patterns/ab-test-rollout).

Why the split exists¶

A/B tests are expensive:

Opportunity cost — traffic you assign to a treatment can't be assigned to another experiment.
Time cost — experiments need to run until statistical significance; weeks for small effects.
Risk cost — mistakes affect real users / real money.
Sample-size constraints — if randomisation must be at a unit coarser than the user (advertiser, shop, city), the sample sizes shrink.

Back-testing flips the cost profile: you can evaluate many candidates cheaply against historical data in hours, eliminate clearly-worse ones, and spend A/B capacity on the survivors.

Yelp's instance¶

Yelp's Back-Testing Engine (2026-02-02) is positioned explicitly as the discovery phase upstream of A/B testing. Verbatim from the post:

"Instead of relying solely on A/B tests, we can affordably and safely simulate a wide range of changes using historical data. This allows us to quickly filter out less ideal candidates and focus A/B tests only on the most promising ideas, preserving A/B testing for final validation rather than discovery."

The pattern is not to replace A/B testing — it is to re-allocate A/B capacity to validated candidates.

Properties¶

The filter doesn't have to be accurate in absolute terms — it just has to rank-order candidates similarly enough to true performance that A/B's top picks are in the filter's shortlist. Precision matters less than recall.
The A/B step remains the source of truth. If the filter and the A/B disagree, the A/B wins. This is the mitigation for concepts/overfitting-to-historical-data: even a filter that overfits to history is safe as long as its false positives lose in A/B.
Feedback from A/B can improve the filter. If back-tests consistently predict bigger wins than A/B tests show, recalibrate the back-testing model (e.g. retrain the counterfactual-outcome predictor). Yelp doesn't describe this closed loop explicitly but the architecture supports it.

Applicability beyond ad-tech¶

The pattern generalises to any domain with expensive live experimentation:

Recommendation systems — offline replay via logged- bandit evaluation (IPS, doubly robust) as pre-filter to A/B.
Search ranking — offline NDCG / MRR on held-out queries as pre-filter to interleaving A/B.
Pricing / auction design — simulation with learned demand models as pre-filter to live A/B.
System configuration — replay traffic in a staging cluster before canarying in production.

Relation to other patterns¶

patterns/ab-test-rollout — the validation phase this concept feeds into; often paired with percentile guardrails.
patterns/snapshot-replay-agent-evaluation — a sibling filter-before-deploy shape in agent engineering: replay historical agent inputs against new code to filter candidates before human eval.

Seen in¶

sources/2026-02-02-yelp-back-testing-engine-ad-budget-allocation — canonical wiki instance. The whole Back-Testing Engine exists to enable this workflow.