Skip to content

CONCEPT Cited by 1 source

Filter before A/B test

Definition

Filter before A/B test is the experimentation-workflow position in which a cheap pre-filter — typically back-testing, simulation, or offline evaluation — is run before committing a candidate to a live A/B test. The cheap filter eliminates candidates that are obvious non-starters; A/B testing validates the survivors on live traffic.

The workflow has two phases with different purposes:

  • Discovery (back-test / simulation) — fast, cheap, wide exploration of the candidate space. False positives are tolerable because the A/B phase will catch them.
  • Validation (A/B test) — slow, expensive, narrow confirmation on live traffic. High-rigour statistical decisions, possibly with percentile guardrails (see patterns/ab-test-rollout).

Why the split exists

A/B tests are expensive:

  • Opportunity cost — traffic you assign to a treatment can't be assigned to another experiment.
  • Time cost — experiments need to run until statistical significance; weeks for small effects.
  • Risk cost — mistakes affect real users / real money.
  • Sample-size constraints — if randomisation must be at a unit coarser than the user (advertiser, shop, city), the sample sizes shrink.

Back-testing flips the cost profile: you can evaluate many candidates cheaply against historical data in hours, eliminate clearly-worse ones, and spend A/B capacity on the survivors.

Yelp's instance

Yelp's Back-Testing Engine (2026-02-02) is positioned explicitly as the discovery phase upstream of A/B testing. Verbatim from the post:

"Instead of relying solely on A/B tests, we can affordably and safely simulate a wide range of changes using historical data. This allows us to quickly filter out less ideal candidates and focus A/B tests only on the most promising ideas, preserving A/B testing for final validation rather than discovery."

The pattern is not to replace A/B testing — it is to re-allocate A/B capacity to validated candidates.

Properties

  • The filter doesn't have to be accurate in absolute terms — it just has to rank-order candidates similarly enough to true performance that A/B's top picks are in the filter's shortlist. Precision matters less than recall.
  • The A/B step remains the source of truth. If the filter and the A/B disagree, the A/B wins. This is the mitigation for concepts/overfitting-to-historical-data: even a filter that overfits to history is safe as long as its false positives lose in A/B.
  • Feedback from A/B can improve the filter. If back-tests consistently predict bigger wins than A/B tests show, recalibrate the back-testing model (e.g. retrain the counterfactual-outcome predictor). Yelp doesn't describe this closed loop explicitly but the architecture supports it.

Applicability beyond ad-tech

The pattern generalises to any domain with expensive live experimentation:

  • Recommendation systems — offline replay via logged- bandit evaluation (IPS, doubly robust) as pre-filter to A/B.
  • Search ranking — offline NDCG / MRR on held-out queries as pre-filter to interleaving A/B.
  • Pricing / auction design — simulation with learned demand models as pre-filter to live A/B.
  • System configuration — replay traffic in a staging cluster before canarying in production.

Relation to other patterns

Seen in

Last updated · 476 distilled / 1,218 read