YELP 2026-02-02 Tier 3

Yelp — How Yelp Built a Back-Testing Engine for Safer, Smarter Ad Budget Allocation¶

Summary¶

Yelp Engineering post (2026-02-02) describing the Back-Testing Engine their Ad Budget Allocation team built to simulate proposed algorithm changes against historical campaign data before committing to full A/B tests. The system is a hybrid of pure back-testing (replay historical data through alternative code paths) and simulation (predict counterfactual outcomes via ML models, since the outcomes of a modified allocation never actually occurred). It is not a generic benchmark harness — it reuses the production Budgeting and Billing repositories via Git Submodules so simulations exercise the exact code that runs in production (just possibly on a different branch).

The motivating pain is specific to ad-tech: A/B tests at the advertiser (not user) level yield small sample sizes, monthly budget cycles force multi-week wait times, and mistakes hit real advertiser money. Back-testing collapses the discovery loop from ~1 month to hours and lets A/B tests focus on validating already-promising candidates.

The architecture is eight components, disclosed by name:

Parameter search space — a YAML file declares the date range, experiment name, and per-parameter search space.
Optimizer — systems/scikit-optimize for Bayesian search; grid search and listed search also supported.
Candidate — a {param: value} dict the optimizer yields.
Production repositories — Budgeting and Billing code pulled in as Git Submodules, pointing at whichever branch you want to test.
Historical daily campaign data — pulled from Redshift for the simulation date range, at campaign × date grain.
ML models for clicks, leads, etc. — systems/catboost regressors that predict daily impressions/clicks/leads from daily budget + campaign features. The same models are used for all candidates so comparisons are fair.
Metrics — per-candidate rollups (avg_cpc, avg_cpl, margin, …) from the replayed per-campaign-per-day simulation.
Logging and visualization — systems/mlflow stores inputs and metrics for each candidate; MLflow's UI provides cross-candidate comparison without extra code.

The simulation loop per candidate, per campaign, per day is verbatim the production daily process: beginning-of-day the Budgeting submodule computes the daily budget + on/off-platform split; throughout the day CatBoost models predict outcomes from the budget; end-of-day the Billing submodule computes billing from the simulated outcomes. Day-by-day dependency is preserved (each day's budget depends on previous days' outcomes), which Yelp names as a "fundamental property" because small changes cascade across the billing period.

CatBoost outputs average expected values; the engine samples integer outcomes via a Poisson distribution to mimic live-system randomness. Yelp explicitly flags this as "not a pure back-testing approach, but rather a hybrid that combines elements of both simulation and back-testing."

Key takeaways¶

Back-testing is the filter, A/B testing is the validator. The Engine exists to "quickly filter out less ideal candidates and focus A/B tests only on the most promising ideas." A/B is preserved for final validation; discovery moves to back-testing. Canonical instance of concepts/filter-before-ab-test. (Source: sources/2026-02-02-yelp-back-testing-engine-ad-budget-allocation)
Production code as Git Submodules is the fidelity lever. Verbatim: "our Engine uses the same code as production by including key repositories (like Budgeting and Billing) as Git Submodules. This lets us simulate current logic or proposed changes by pointing to specific Git branches." The engine does not re-implement production logic; it imports it. To test Algorithm X, engineers create a branch in the Budgeting repo, configure the submodule pointer, and run the simulation. Canonical instance of patterns/production-code-as-submodule-for-simulation. Verbatim benefit: "blurs the line between prototyping and production, streamlining our workflows." (Source: sources/2026-02-02-yelp-back-testing-engine-ad-budget-allocation)
ML-predicted outcomes make the simulation hybrid, not pure. Replaying historical allocations against alternative algorithms means the actual outcomes (clicks, leads) never happened under the new budget — they must be predicted. Yelp uses non-parametric CatBoost regressors rather than constant-cost assumptions to "accurately capture complex effects such as diminishing returns on budget". Same models for all candidates preserves fair comparison. Canonical instance of concepts/counterfactual-outcome-prediction and the parent pattern patterns/historical-replay-with-ml-outcome-predictor. (Source: sources/2026-02-02-yelp-back-testing-engine-ad-budget-allocation)
Poisson sampling converts expected values into integer outcomes. The CatBoost models output averages (e.g. expected clicks = 12.3). The engine samples the realized integer count from a Poisson distribution parameterised by that expected value. This reintroduces the randomness of live systems without which all candidates would look deterministically the same given equal budgets. Canonical instance of concepts/poisson-sampling-for-integer-outcomes. (Source: sources/2026-02-02-yelp-back-testing-engine-ad-budget-allocation)
Bayesian optimization over a YAML-declared search space. The YAML file lists parameters, their ranges/categories, and max_evals (25 in the example). Scikit-Optimize proposes an initial random candidate, the engine simulates it, the optimizer uses the returned minimize_metric (e.g. average-cpl) to propose the next candidate — an iterative loop that "learns from previous results to propose combinations more likely to optimize the target metric." Grid search and listed search are available as alternatives but are "just a wrapper that yields the next candidate" — no learning. Canonical instance of concepts/bayesian-optimization-over-parameter-space and patterns/yaml-declared-experiment-config. (Source: sources/2026-02-02-yelp-back-testing-engine-ad-budget-allocation)
Day-by-day cascade is preserved verbatim. Each day's budget decision depends on previous days' realized outcomes. The Engine replays this day-by-day so small algorithmic changes compound across the billing period. Yelp calls this "a fundamental property to take into account" — any cheap aggregate-data approximation would miss the compounding effect. (Source: sources/2026-02-02-yelp-back-testing-engine-ad-budget-allocation)
MLflow is the experiment store, not the training tracker. MLflow is used for its logging + visualization surface: every candidate's input parameters and output metrics get logged, and the UI provides cross-candidate comparison without extra coding. This is MLflow outside its usual model-training context — as a generic experiment database. New Seen-in for systems/mlflow at the ad-experimentation / back-testing substrate altitude, distinct from its prior Databricks LLM-eval Seen-ins. (Source: sources/2026-02-02-yelp-back-testing-engine-ad-budget-allocation)
Overfitting to history is named explicitly as a limitation. Verbatim: "Relying too heavily on historical simulations could bias development toward optimizations that perform well on past data, potentially limiting innovation." Yelp's mitigation is not a technical fix — it's keeping A/B tests
real-world monitoring in the loop downstream of back-testing. Canonical instance of concepts/overfitting-to-historical-data in a production experimentation context. (Source: sources/2026-02-02-yelp-back-testing-engine-ad-budget-allocation)

Operational numbers¶

Hundreds of thousands of campaigns per month — Yelp's ad-system scale.
Monthly budget cycle — advertisers set monthly budgets, which is the constraint that makes A/B tests slow.
25 max_evals — example Scikit-Optimize budget in the YAML sample.
5 × 3 × 10 = 150 — illustrative grid-search combinatorial fan-out from three parameters.
-10 to +10 — example real-valued range for the alpha parameter.
Daily granularity — campaign × date level, matching production-environment grain.

Systems and primitives canonicalised¶

New systems¶

systems/yelp-back-testing-engine — the named system.
systems/yelp-ad-budget-allocation — the parent system the Engine simulates (splits campaign spend between on-platform Yelp inventory and the off-platform Yelp Ad Network).
systems/scikit-optimize — the Bayesian-optimization library used as the optimizer.
systems/catboost — the gradient-boosted regressor library used for outcome prediction.

New concepts¶

concepts/hybrid-backtesting-with-ml-counterfactual — the overall shape (replay + ML-counterfactual-outcome).
concepts/bayesian-optimization-over-parameter-space — sequential candidate selection using prior evaluation results.
concepts/overfitting-to-historical-data — the named risk of any historical-replay methodology.
concepts/filter-before-ab-test — the experimentation workflow position (pre-filter the hypothesis space).
concepts/counterfactual-outcome-prediction — the sub-concept: outcomes that didn't happen under the new treatment are predicted from campaign features + new treatment.
concepts/poisson-sampling-for-integer-outcomes — the trick that turns ML-predicted averages into realistic integer counts.

New patterns¶

patterns/production-code-as-submodule-for-simulation — the Git-Submodules-pointing-at-branches fidelity primitive.
patterns/historical-replay-with-ml-outcome-predictor — the full simulation-loop shape.
patterns/yaml-declared-experiment-config — the config shape (date range, experiment name, search space per parameter, search type, max_evals).

Caveats¶

Ad-specific domain — the specific trade-offs (advertiser- level A/B small sample, monthly budget cycles) are ad-tech constraints. The general pattern (back-testing as A/B filter
production-code-as-submodule + ML-counterfactual) generalises broadly, but the motivating pain may not.
No latency numbers — how long does a single candidate simulate? How many parallel candidates can run at once? The post is silent.
No ML-model accuracy numbers — Yelp says they "monitor these models to prevent overfitting, checking that performance is consistent between training and hold-out datasets" but no MAE / MAPE / R² numbers are disclosed.
No scale numbers for the Engine itself — total candidates evaluated per month, MLflow experiment-store size, Redshift query volume.
No diagram of how the 8 components wire together — two figures are referenced (Figure 1 campaign journey, Figure 2 system architecture) but the post text doesn't precisely specify how the optimizer-candidate-simulation loop is parallelized (per-campaign? per-day? per-candidate?).
The third-party ad network (the Yelp Ad Network / off- platform partner) is unnamed architecturally — the ML model accounts for "external systems we don't directly control (e.g., partner ad networks)" as a source of prediction uncertainty but the specific integration (API? daily batch?) is opaque.
No A/B cross-validation numbers — how often do back-test winners also win in A/B? The post asserts the workflow works but provides no backward-looking concordance metric.

Relation to other wiki material¶

Snapshot-replay agent evaluation (Databricks, patterns/snapshot-replay-agent-evaluation) — nearest conceptual sibling at a different altitude: Databricks replays historical agent traces against new agent code to filter candidates before human evaluation. Both are "replay history against new code as the pre-A/B filter," just in different domains (agent-eval vs ad-budget).
Spark ETL checkpoint debugging (Yelp 2025-02-19, concepts/checkpoint-intermediate-dataframe-debugging) — a sibling Yelp primitive: also about inspecting intermediate state during iteration, but for pipeline debugging rather than experimentation.
MLflow (systems/mlflow) — prior Seen-ins are Databricks LLM-evaluation substrates; this Yelp ingest adds the first non-LLM experiment-store Seen-in, reinforcing that MLflow's core primitive (track inputs + outputs per run, UI-compare across runs) is domain-general.
AB-test rollout with percentile guardrails (patterns/ab-test-rollout) — Yelp's pattern sits upstream of this: back-test filters the candidate space, then A/B with percentile guardrails validates the winner on live traffic.

Contradiction / tension¶

"Scientific rigor" claim vs ML-prediction dependency. Yelp positions the Engine as more rigorous than "back-of-the-envelope calculations using aggregate data" (which is true), but its fidelity is bounded by CatBoost model accuracy. They acknowledge this ("The accuracy of this methodology depends heavily on the quality and generalizability of the underlying ML models") but provide no model-quality numbers to let readers calibrate confidence. The Engine is more rigorous than aggregate math, but its absolute accuracy is unstated.