CONCEPT Cited by 4 sources
Relevance labeling¶
Relevance labeling is the activity of assigning a graded score to a (query, document) pair that encodes how well the document satisfies the query. The labels are the supervised training signal for a learning-to-rank model and the gold-standard evaluation set for metrics like NDCG.
Core properties¶
- Per-pair, not per-document. Relevance is a function of both the query and the document; the same doc can be a 5 for one query and a 1 for another. Dropbox framing: "Relevance isn't a fixed property of a document; it depends on the specific query, the user's context, and the moment the search is made." (Source: sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search)
- Graded, not binary. Typical scales are 0–3 or 1–5 (Dash's choice); graded labels are the pre-condition for NDCG, which rewards near-perfect ordering and top-K accuracy. Binary relevance collapses the dynamic range.
- Context-dependent. User identity, time, prior session signal can all shift the ideal label — the label is a snapshot of a specific evaluation frame.
Label provenance classes¶
Three sources named in the Dash post, each with distinct cost / scale / coverage / privacy properties:
| Source | Cost | Scale | Coverage | Privacy |
|---|---|---|---|---|
| User behaviour (clicks, skips) | Free | Large | Sparse, biased by ranker | OK but incomplete |
| Human labelers | High | Low | Comprehensive, consistent | Can't cover customer data |
| LLM judges | Medium | Very large | Comprehensive, multilingual | Can cover customer data within compliance |
Dash explicitly demotes behaviour signal to a supplement and uses it as the sampling signal for behaviour-discrepancy sampling rather than the label itself. Primary label source: calibrated LLM judge (patterns/human-calibrated-llm-labeling).
Dash's 1–5 scale¶
"For the purposes of this article, relevance is treated as a graded score on a 1–5 scale. A score of 5 means the result closely matches what the user is trying to find, while a score of 1 means it isn't useful enough to show."
Scoring shape:
- 5 — closely matches intent; top-of-list candidate.
- 4 — strong match; should appear in top results.
- 3 — partial match; acceptable in longer lists.
- 2 — marginal match; usually should not show.
- 1 — not useful enough to show.
Why it's the bottleneck on RAG quality¶
Framing from the Dash post:
"The quality of search ranking—and the labeled relevance data used to train it—[is] critical to the quality of the final answer."
The chain:
- Answering LLM can consume only a small subset of the corpus (context window).
- Which subset it sees is decided by the ranker.
- The ranker is trained on relevance labels.
- Therefore label quality caps answer quality.
This is why Dropbox invests in the human-calibrated LLM labeling pipeline: every quality improvement propagates through the ranker into every RAG answer Dash produces.
Measurement of label quality¶
Inter-annotator agreement (human ↔ human): baseline realism. Dash notes "even humans—multiple humans—will disagree on the relevance set." Sets a floor on achievable agreement.
Judge-vs-human agreement (LLM ↔ human seed set): MSE on the 1–5 scale, range 0–16 (0 = exact agreement, 16 = max disagreement on a single pair). Small disagreements (4 vs 5) get small penalty; large ones (1 vs 5) get quadratically larger penalty. Dash's reported disagreement-reduction arc is measured in MSE.
Behaviour-vs-label consistency: flag pairs where users systematically click low-labeled docs or skip high-labeled docs; route to human review (see patterns/behavior-discrepancy-sampling).
Relationship to NDCG¶
NDCG consumes relevance labels. It can't measure ranking quality without them. Hence label quality → NDCG validity → ranker-quality metric validity → trustworthy iteration.
Relationship to LLM-as-judge¶
LLM-as-judge is one source of relevance labels. The general pattern (judge scoring any model output against a rubric) specialises here to (judge scoring documents against queries on a 1–5 relevance scale).
Caveats¶
- Labels are not ground truth. They're rubric applications. Rubric design matters more than labeler identity.
- Drift. Rubrics, product expectations, and content distributions all drift. Re-calibrate the seed set periodically.
- Privacy ceiling on humans. Humans can't review customer data → pure human labeling undersamples exactly the content distribution the production system serves. LLMs break the ceiling (within compliance).
- Multi-modal relevance. The 1–5 scale doesn't obviously generalise to images / video / audio. Dash names this as future work.
Seen in¶
- sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search — canonical framing. 1–5 scale, per-(query,doc)-pair, three provenance classes, MSE-0–16 agreement metric, human-calibrated-LLM pipeline as the production shape.
- sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash — upstream framing. NDCG scored against relevance labels; judge-vs-human disagreement at ~8% baseline reduced via prompt refinement + reasoning model + RAG-as-a-judge + DSPy.
- sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy
— model-adaptation edition. Same 1–5 relevance scale measured
via NMSE
rescaled to 0–100 (rather than raw MSE 0–16). Demonstrates
DSPy can retarget the labeling judge across models at
preserved (or improved) agreement — cutting NMSE 45% on
gpt-oss-120b— while growing training-label coverage 10–100× at fixed cost. - sources/2026-04-21-figma-how-we-built-ai-powered-search-in-figma — visual-content edition (design frames, not text docs). Figma's eval pipeline labels (query, frame) pairs on correct / incorrect via a canvas- based labeling plugin built on Figma's own public plugin API — infinite canvas + keyboard shortcuts + historical run comparison. Eval set seeded from internal-designer interviews + file-browser usage analysis. Complementary to Dash's LLM-calibrated pipeline: Figma does not describe LLM-as-judge extension (may be future work); the article anchors the concepts/similarity-tier-retrieval framing that label sets must cover exact / near / diverse tiers because users start from close matches and expand outward.
Related¶
- concepts/llm-as-judge — the primary label-generation mechanism at scale.
- concepts/rag-as-a-judge — judge augmented with retrieval to resolve work-context vocabulary.
- concepts/ndcg — consumes relevance labels as input.
- patterns/human-calibrated-llm-labeling — production shape for generating labels at scale.
- patterns/behavior-discrepancy-sampling — using user behaviour to target human review.
- patterns/judge-query-context-tooling — giving the judge tools so it can label organisation-specific queries correctly.
- systems/dash-relevance-ranker — the model trained on these labels.
- systems/dash-search-index — the retrieval surface the ranker orders.