Skip to content

PATTERN Cited by 1 source

Behavior-discrepancy sampling

Behavior-discrepancy sampling is the pattern of prioritising evaluation effort on cases where LLM-predicted labels and observed user behaviour disagree, rather than labeling uniformly. The disagreements are the cases most likely to expose judge error, so they pay the highest return on human-review + prompt-refinement investment.

Mechanics

  1. LLM judge scores the corpus. Every (query, document) pair in the candidate set has an LLM-assigned relevance score (e.g. 1–5 — see concepts/relevance-labeling).
  2. Overlay user-behaviour signals. Click-through, dwell time, skip patterns on production traffic.
  3. Flag discrepancies. Two clear failure shapes:
  4. Click on low-rated result. User clicked a document the LLM scored 1–2. Either the LLM is wrong about relevance, or the UI presented a misleading snippet — either way, worth reviewing.
  5. Skip high-rated result. User consistently skipped a document the LLM scored 4–5. Either over-rating, or personalisation not captured, or ranking-position effect.
  6. Route to human review. Discrepant cases go to the human-labeling queue. Cheap cases (LLM + behaviour agree) don't.
  7. Refine. Prompt updates, additional context, or added tools address the discrepancy-generating failure mode; re-score, re-check, iterate.
  8. Repeat until plateau. "The process is repeated iteratively until major sources of error are addressed or improvements plateau."

Why uniform sampling is wasteful

Uniform sampling of candidates for human review would spend most of the budget on cases where LLM + behaviour agree — which are exactly the cases where the LLM probably wasn't wrong. Discrepancy sampling concentrates effort where the signal-to-noise on "the LLM mis-scored this" is highest.

Same logic as active learning in classical ML — pick points near the decision boundary first — but with behaviour-signal as the cheap proxy for uncertainty rather than model entropy.

Why behaviour signal alone isn't the label

Explicit framing in the Dropbox post:

"Signals from user behavior can still be helpful, but on their own they tend to be incomplete, influenced by existing rankings, and unevenly distributed. In practice, they work best as a supplement to labeled data rather than a replacement for it."

Specifically:

  • Ranking-position bias. A doc in position 10 gets clicked less than the same doc in position 2 regardless of relevance.
  • Sparse coverage. Most query–doc pairs never get shown → no behaviour signal.
  • Survivorship. Behaviour only reflects what the current ranker already showed; can't surface documents the ranker is missing.

So behaviour is the sampling signal, not the label itself. The label still comes from LLM judging (+ calibrated human review on discrepancies).

Tradeoffs

  • Cold-start. Needs existing production behaviour signal, which requires a live ranker. Bootstrap with uniform sampling first.
  • Feedback loop. Discrepancy sampling on today's ranker surfaces today's failure modes. After refinement, the distribution of errors shifts; re-run discovery.
  • Ranking-bias confounding. A click-through-disparity might reflect UX issues (bad title, thumbnail) rather than relevance judgement. Cross-check with other signals.
  • Privacy scope. Behaviour signals come from real user traffic; human review of the flagged cases must still honour the "limited, non-sensitive internal datasets" boundary — discrepancy flags don't lower the privacy bar.

When to reach for it

  • You have a production system emitting behaviour signal.
  • You have a secondary labeling source (LLM judge, model prediction) whose errors you want to find.
  • Your human-review budget is the binding constraint on label quality.
  • Uniform sampling has plateaued on judge-vs-human agreement.

Seen in

Last updated · 200 distilled / 1,178 read