Skip to content

SYSTEM Cited by 1 source

Yelp Back-Testing Engine

Definition

The Back-Testing Engine is Yelp's internal simulation system for evaluating proposed changes to the Ad Budget Allocation algorithms against historical campaign data before promoting them to A/B tests. It replays the day-by-day budget-decision process of past campaigns through alternative algorithm code paths and predicts counterfactual outcomes via ML models — a hybrid of pure back-testing and simulation. Disclosed in the 2026-02-02 Yelp Engineering blog post.

Architecture — eight named components

  1. Parameter search space (YAML) — declares the simulation date_interval, experiment_name, and per-parameter search space. Example from the post:
date_interval:
  - '2025-12-01'
  - '2025-12-31'
experiment_name: 'algorithm_x_vs_status_quo'
searches:
  - search_type: 'scikit-opt'
    minimize_metric: 'average-cpl'
    max_evals: 25
    search_space:
      allocation_algo: skopt.space.Categorical(['status-quo', 'algorithm_x'])
      alpha: skopt.space.Real(-10, 10)
  1. Optimizersystems/scikit-optimize for Bayesian search (the default), with grid search and listed search as alternatives. Bayesian is the only "true" optimizer; the others are "just a wrapper that yields the next candidate."

  2. Candidate — a {param: value} dict the optimizer yields. Example: {'allocation_algo': 'status_quo', 'alpha': 3.53}.

  3. Production repositories (Git Submodules) — the Budgeting and Billing repos are included as submodules, pointing at whichever branch is under test. Canonical instance of patterns/production-code-as-submodule-for-simulation.

  4. Historical daily campaign data (Redshift) — pulled for the simulation date range at campaign × date grain, matching the production environment.

  5. ML models for clicks, leads, etc. (systems/catboost) — predict expected impressions/clicks/leads from daily budget

  6. campaign features. Same models for all candidates; non- parametric to capture "complex effects such as diminishing returns on budget."

  7. Metrics — per-candidate rollups (avg_cpc, avg_cpl, margin, …) from replaying each campaign for each day.

  8. Logging and visualization (systems/mlflow) — inputs + output metrics per candidate get logged; MLflow's UI provides cross-candidate comparison.

Simulation loop

For each candidate × each campaign × each day:

  • Beginning of day — Budgeting submodule (with candidate parameters) computes daily budget + on/off-platform split.
  • Throughout the day — CatBoost models predict impressions/clicks/leads from budget + campaign features. Poisson-sampled to integer counts.
  • End of day — Billing submodule computes billing from simulated outcomes.

Day-by-day dependency is preserved — each day's budget decision depends on prior days' outcomes. Yelp flags this as "a fundamental property to take into account" because small algorithmic changes compound across the billing period.

Position in experimentation workflow

The Back-Testing Engine is positioned upstream of A/B testing, not as a replacement:

  • Old workflow: hypothesis → A/B test → measure (wait ~1 month for monthly budget cycle).
  • New workflow: hypothesis → back-test (hours) → promising candidates → A/B test (still ~1 month, but on already-filtered candidates).

Canonical instance of concepts/filter-before-ab-test. The Engine does not eliminate A/B testing; it "quickly filter[s] out less ideal candidates and focus[es] A/B tests only on the most promising ideas, preserving A/B testing for final validation rather than discovery."

Reported benefits

Verbatim from the post:

  • Faster productionization"blurs the line between prototyping and production, streamlining our workflows." The Git-submodule-pointing-at-branch primitive means no prototype→production translation phase.
  • Improved collaboration"Scientists and engineers can now work side-by-side with production code, turning experiments into reusable, production-ready artifacts, rather than disconnected notebooks."
  • Increased prediction accuracy — ML-driven simulations capture non-linearities (diminishing returns on budget, variable CPC/CPL at different budget levels) that aggregate- data math misses.
  • System fidelity — day-by-day replay matches production granularity.
  • Early bug detection"Running simulations across a broad set of real data helps us catch code bugs or edge cases that would be tricky to find with unit tests alone." The Engine doubles as a differential-testing harness against prod code.

Caveats (disclosed on the post)

  • Not a perfect predictor — historical data may not cover market/user/partner behaviour shifts.
  • Overfitting risk"Relying too heavily on historical simulations could bias development toward optimizations that perform well on past data, potentially limiting innovation." See concepts/overfitting-to-historical-data.
  • ML model dependency — accuracy of the Engine is bounded by accuracy of the CatBoost counterfactual-outcome models.

Scale

  • Hundreds of thousands of campaigns per month (Yelp ad system scale).
  • max_evals = 25 is the example Scikit-Opt budget.
  • Daily granularity per simulation tick.
  • Specific Engine-level throughput/latency numbers are not disclosed.

Seen in

Last updated · 476 distilled / 1,218 read