Skip to content

CONCEPT Cited by 4 sources

Relevance labeling

Relevance labeling is the activity of assigning a graded score to a (query, document) pair that encodes how well the document satisfies the query. The labels are the supervised training signal for a learning-to-rank model and the gold-standard evaluation set for metrics like NDCG.

Core properties

  • Per-pair, not per-document. Relevance is a function of both the query and the document; the same doc can be a 5 for one query and a 1 for another. Dropbox framing: "Relevance isn't a fixed property of a document; it depends on the specific query, the user's context, and the moment the search is made." (Source: sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search)
  • Graded, not binary. Typical scales are 0–3 or 1–5 (Dash's choice); graded labels are the pre-condition for NDCG, which rewards near-perfect ordering and top-K accuracy. Binary relevance collapses the dynamic range.
  • Context-dependent. User identity, time, prior session signal can all shift the ideal label — the label is a snapshot of a specific evaluation frame.

Label provenance classes

Three sources named in the Dash post, each with distinct cost / scale / coverage / privacy properties:

Source Cost Scale Coverage Privacy
User behaviour (clicks, skips) Free Large Sparse, biased by ranker OK but incomplete
Human labelers High Low Comprehensive, consistent Can't cover customer data
LLM judges Medium Very large Comprehensive, multilingual Can cover customer data within compliance

Dash explicitly demotes behaviour signal to a supplement and uses it as the sampling signal for behaviour-discrepancy sampling rather than the label itself. Primary label source: calibrated LLM judge (patterns/human-calibrated-llm-labeling).

Dash's 1–5 scale

"For the purposes of this article, relevance is treated as a graded score on a 1–5 scale. A score of 5 means the result closely matches what the user is trying to find, while a score of 1 means it isn't useful enough to show."

Scoring shape:

  • 5 — closely matches intent; top-of-list candidate.
  • 4 — strong match; should appear in top results.
  • 3 — partial match; acceptable in longer lists.
  • 2 — marginal match; usually should not show.
  • 1 — not useful enough to show.

Why it's the bottleneck on RAG quality

Framing from the Dash post:

"The quality of search ranking—and the labeled relevance data used to train it—[is] critical to the quality of the final answer."

The chain:

  1. Answering LLM can consume only a small subset of the corpus (context window).
  2. Which subset it sees is decided by the ranker.
  3. The ranker is trained on relevance labels.
  4. Therefore label quality caps answer quality.

This is why Dropbox invests in the human-calibrated LLM labeling pipeline: every quality improvement propagates through the ranker into every RAG answer Dash produces.

Measurement of label quality

Inter-annotator agreement (human ↔ human): baseline realism. Dash notes "even humans—multiple humans—will disagree on the relevance set." Sets a floor on achievable agreement.

Judge-vs-human agreement (LLM ↔ human seed set): MSE on the 1–5 scale, range 0–16 (0 = exact agreement, 16 = max disagreement on a single pair). Small disagreements (4 vs 5) get small penalty; large ones (1 vs 5) get quadratically larger penalty. Dash's reported disagreement-reduction arc is measured in MSE.

Behaviour-vs-label consistency: flag pairs where users systematically click low-labeled docs or skip high-labeled docs; route to human review (see patterns/behavior-discrepancy-sampling).

Relationship to NDCG

NDCG consumes relevance labels. It can't measure ranking quality without them. Hence label quality → NDCG validity → ranker-quality metric validity → trustworthy iteration.

Relationship to LLM-as-judge

LLM-as-judge is one source of relevance labels. The general pattern (judge scoring any model output against a rubric) specialises here to (judge scoring documents against queries on a 1–5 relevance scale).

Caveats

  • Labels are not ground truth. They're rubric applications. Rubric design matters more than labeler identity.
  • Drift. Rubrics, product expectations, and content distributions all drift. Re-calibrate the seed set periodically.
  • Privacy ceiling on humans. Humans can't review customer data → pure human labeling undersamples exactly the content distribution the production system serves. LLMs break the ceiling (within compliance).
  • Multi-modal relevance. The 1–5 scale doesn't obviously generalise to images / video / audio. Dash names this as future work.

Seen in

Last updated · 200 distilled / 1,178 read