Skip to content

LYFT 2026-03-25

Read original ↗

Lyft — Beyond A/B Testing: Using Surrogacy and Region-Splits to Measure Long-Term Effects in Marketplaces

Summary

Lyft's Foundational Models team (Amber Wang and Yoonji Kim) describes the methodology it uses to estimate the long-term effects of resource-allocation decisions in Lyft's two-sided marketplace — e.g. "if we increase driver incentive spending by x% in week 1, what is the cumulative impact on rides over the following N weeks?" In a multi-sided marketplace, such decisions create both direct long-term effects (driver retention) and harder-to-measure market-mediated long-term effects (more driver hours → less surge → better rider experience → future rider retention; or more driver hours → idler drivers → lower driver retention). Classical user-split A/B testing cannot observe market-mediated effects because treatment and control share the same market. Lyft's framework combines (1) a surrogacy approach (two-step observational causal inference mapping policy → short-term negative user experience → future behaviour), and (2) region-split experiments to verify end-to-end long-term effects, with a forward-selection algorithm to pick treated/control regions that maximise pre-intervention fit and expected statistical power.

Key takeaways

  1. Market-mediated effects named as a first-class problem. In a multi-sided marketplace, a policy change on one side propagates through shared market state to the other side. Lyft calls these "market-mediated" long-term effects and distinguishes them from the easier-to-estimate "direct" long-term effects (e.g. incentive → driver retention) that standard user-split A/B testing can measure. Canonical statement of why market-mediated effects are the hard part of long-term-effect measurement in marketplaces.

  2. Surrogacy as a two-step decomposition. Lyft decomposes the mediated long-term effect into: (Step 1) policy decision → distribution of negative user experiences (wait time, surge, cancellations; driver earnings, idleness, incentives); (Step 2) negative user experiences → future outcomes (future rides, retention, driver hours). The composition, via a surrogacy index, links today's decision to long-run business impact. Canonical wiki instance of the surrogacy approach as a practical long-term-effect estimator. Key assumption flagged verbatim in the post: "market-mediated long-term effects are completely mediated by short-term negative user experiences."

  3. Step 1 estimator: residualized regression on deviation from normal. Negative user experiences are cyclical and seasonal (time-of-week, holidays, weather, supply/demand). Lyft models them as deviations from the market's own baseline, with residualised regression that controls for remaining market information (supply + demand). The policy coefficient reads as an elasticity around everyday operating conditions. Output is a calibrated response function — a distribution-level forecast, not just a mean, with uncertainty.

  4. Step 2 estimator: AIPW (doubly-robust observational causal inference). Lyft uses Augmented Inverse Probability Weighting (Chernozhukov et al., 2021) combining (i) a propensity model for exposure (how likely is a given level of negative user experience, given context) and (ii) outcome models for future metrics conditional on confounders. Doubly-robust = estimator stays unbiased if either model is correctly specified. Produces a surrogacy index that scales short-term exposure to long-term impact. Canonical wiki instance of AIPW used for surrogate-to-outcome mapping.

  5. Multi-imperfect-verification architecture for the two steps. No single experiment verifies the whole chain; Lyft verifies each step with its own purpose-built experiment:

  6. Step 1 verified by switch-back experiments that alternate policy settings across comparable time slots and compare modelled vs. observed lifts in negative user experience.
  7. Step 2 verified by user-split experiments that perturb negative experiences and check calibration (predicted vs. observed) on future outcomes.
  8. End-to-end verified by region-split experiments that apply a policy shock to a whole market and track treated-vs-control divergence.

  9. Region-split is necessary because market-mediated effects need a whole market. User-split cannot observe market-mediated effects — the treated and control groups share the same rider/driver pool, so supply/demand shifts induced by treatment also affect control. A region-level split is the only shape that lets the market itself respond to treatment, so the measured delta includes both direct and mediated long-term effects.

  10. Forward-selection algorithm for region-split design. Region-split experiments "in general suffer from poor pre-intervention fit and low power." Lyft's design-time contribution: inspired by the forward difference-in-differences (FDiD) approach (Li, 2024), starting from a single treated region, iteratively add treated regions that best improve pre-period fit and expected power. Canonical wiki instance of forward-selection experiment design.

  11. Observational inference keeps the framework cheap to update. Both steps are done with observational causal inference (not rolling experiments), so the surrogacy index can be re-estimated cheaply as the market evolves. Experiments are reserved for verification and calibration of the observational estimates, not primary estimation. Architecturally, this separates a fast-updating observational estimation pipeline from a lower-frequency experimental verification pipeline — a design stance worth isolating on its own.

  12. Direct + market-mediated composed into a single forecast. The direct long-term effect is estimated separately; Lyft combines it with the market-mediated effect from Steps 1–2 using "a transparent formula grounded in market mechanics" to yield a single policy-level forecast for long-run rides and financials. The region-split experiment validates the composed forecast. The post does not share the formula, but the architecture is: observational-estimator → composition → region-split verification.

Architectural primitives extracted

  • Surrogacy in causal inference — the design stance of estimating a long-term effect via a short-term mediator whose relationship to the outcome can be modelled separately. Lyft's two-step decomposition (policy → neg-UX → future behaviour) is the canonical wiki instance.
  • Market-mediated long-term effects — in a multi-sided marketplace, the indirect effects of a policy change that propagate through shared market state rather than directly to the targeted users.
  • Residualized regression — regression on deviations from a learned baseline, used so cyclical/ seasonal variation doesn't confound a policy-effect coefficient.
  • AIPW — doubly-robust causal estimator combining a propensity model and an outcome model; unbiased if either is correctly specified.
  • Switch-back experiment — time-based experiment that alternates policy settings across comparable time slots in the same market. Used to verify Step 1.
  • Region-split experiment — geography-based experiment applying a policy to a subset of markets and comparing to control markets. Used to verify end-to-end long-term effects including market mediation.
  • User-split experiment — classical randomised A/B on individual users; used here to verify Step 2 (exposure → outcome) because at the individual level the market-mediation confound is small. User-split cannot validate market-mediated claims.
  • Two-step surrogacy estimator for long-term effects — the composed pattern: Step 1 (policy → short-term mediator via residualised regression + switch-back verification) + Step 2 (mediator → outcome via AIPW + user-split verification) + composition into long-term-effect forecast + region-split end-to-end verification.
  • Forward-selection experiment design — greedy, pre-period-fit-aware algorithm for picking treated/control regions in a region-split experiment; inspired by forward difference-in-differences (FDiD, Li 2024).

Operational numbers from the post

  • None quantified. The post is methodology-disclosure, not a results-disclosure. No concrete elasticities, no before/after forecast errors, no region counts, no dollar numbers, no latency / pipeline-runtime numbers, no sample sizes, no experiment cadence. One simulated example illustrates a region-split, but it uses simulated data.

Caveats

  • Methodology paper voice, not retrospective. No postmortem of a specific pricing / incentive decision; no before/after disclosure of forecast error; no proof that the surrogacy index matches long-run observed outcomes in Lyft's own data. The evidence that the framework works is structural (two steps, each independently verifiable; verification methods named; forward selection inspired by a published algorithm).
  • Key assumption is strong. "Market-mediated long-term effects are completely mediated by short-term negative user experiences" is the load-bearing identification assumption. Any long-term channel that does not route through today's negative user experience (e.g. slow trust-shift from news coverage, brand image, second-order competitive response) is invisible to Step 1
  • Step 2. The end-to-end region-split verification is what catches assumption violations — but only on the small set of policy shocks that get region-tested.
  • Verification experiments are imperfect in different ways. Switch-back cannot rule out longer-than-slot carryover effects; user-split cannot validate market-mediation; region-split has low power and poor pre-fit — Lyft's forward-selection algorithm addresses the pre-fit and power problem but doesn't eliminate the noise floor. The post is explicit that "there is no single form of experiment that can provide a perfect verification."
  • Observational inference depends on the right confounders. Step 2's AIPW is doubly-robust, but doubly-robust isn't omniscient. If a confounder is unobserved on both model specifications (propensity + outcome), AIPW is biased. In a marketplace with weather, events, competitor pricing, and macroeconomic shocks, the unobserved-confounder surface is non-trivial.
  • "Negative user experiences" taxonomy is not enumerated exhaustively. The post cites wait time, surge, cancellations (rider) and hourly earnings, idleness, incentive earnings (driver) but does not specify the full mediator set or its dimensionality, nor whether rare / extreme experiences are modelled separately.
  • No systems-infrastructure disclosure. The post is purely methodology — no mention of the pipeline substrate, feature store, experiment platform, causal-inference library, or compute shape. The wiki ingests this on the experimentation-infrastructure axis; the implementation shape is inferable from prior Lyft sources (e.g. systems/lyft-feature-store, systems/lyftlearn) but not disclosed here.
  • Direct long-term effect estimation is elided. "The direct long term effect is estimated separately" — how, with what estimator, with what identification strategy, not in this post. Composition formula "grounded in market mechanics" is also not shared.

Source

Last updated · 319 distilled / 1,201 read