Skip to content

CONCEPT Cited by 1 source

Lift metric

Definition

In an interleaving test of ranking A vs ranking B, the lift metric aggregates per-search (or per-user) preference into a single scalar.

lift = (wins_A − wins_B) / (wins_A + wins_B + α · ties)
  • wins_A = number of searches (or users) where ranking A accumulated more attributed events (e.g., clicks or bookings).
  • wins_B = mirror.
  • ties = searches / users with equal attribution to A and B.
  • α ∈ [0, 1] is the tie weight — different conventions exist for how to normalise for ties. Expedia's post notes that "the results do not strongly depend on the normalization method."

Properties:

  • lift = 0 ⇒ no user preference between A and B (the null hypothesis being tested for significance).
  • lift > 0 ⇒ users prefer A.
  • lift < 0 ⇒ users prefer B.
  • The metric captures direction, not magnitude, of user preference — not to be confused with CVR uplift which measures absolute change in conversion rate.

Aggregation levels

Expedia reports at two levels:

  • Per-search: each individual search produces a winning variant; aggregate across searches.
  • Per-user (Expedia's default): bucket searches by user and let each user cast one vote — users with mixed wins or no preference count as ties. Reduces the risk that a handful of heavy-searcher users dominate the metric.

Per-event-type split

Expedia tracks two lift metrics independently:

  • Click lift — based on property-detail-page views (click-through). Denser, higher-frequency; detects faster.
  • Booking lift — based on completed booking transactions. Rarer; closer to revenue; detects slower.

Reporting both "improves our understanding of the impact of rankings to both conversion and click-through rates."

Significance testing

lift = 0 is the null hypothesis. To decide whether an observed lift is distinguishable from zero:

Caveats

  • Normalisation choice is a hyperparameter. Different conventions for α (tie weight) can shift magnitude but Expedia reports that direction and significance are robust.
  • User-level reporting requires user attribution. Logged-out traffic with unstable identifiers pollutes the user-bucketing step.
  • Lift is not comparable across experiments with different baseline ranking quality or different query mixes; it's a within-experiment directional signal.
  • Lift is not CVR uplift. A lift of +0.1 doesn't mean 10 % more CVR; launch decisions need A/B rollouts for the absolute number.

Seen in

Last updated · 200 distilled / 1,178 read