DROPBOX 2026-02-26 Tier 2

Using LLMs to amplify human labeling and improve Dash search relevance¶

Summary¶

Dropbox Tech post on how Dash trains the search relevance model that sits under its retrieval tool — specifically, where the labeled relevance judgements come from. Framing: the ranker is XGBoost-class learning-to-rank (not rules), relevance is a graded 1–5 score per (query, document) pair, and the training-data pipeline is a two-stage, human-calibrated LLM-labeling loop. A small internal team hand-labels a seed dataset; that seed calibrates an LLM judge against human; the tuned LLM then generates hundreds of thousands to millions of labels to train the production ranker — "orders of magnitude smaller [human labels] than what would be required for full training," and the LLM is "a force multiplier for human effort." The evaluation metric for judge-vs-human agreement is Mean Squared Error on the 1–5 scale (range 0 to 16). The post positions this labeling pipeline as the upstream of the disagreement-reduction judge arc covered in the [[sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash|2026-01-28 Clemm transcript]] and enumerates two new mechanisms for improving LLM-label quality: behaviour-discrepancy sampling (prioritise the click-but-low-score / skip-but-high-score cases for review and prompt refinement) and query-context tooling for the judge (give the judge retrieval tools so it can research Dropbox-internal acronyms like "diet sprite" before scoring). Complements Clemm's arc: Clemm described the LLM-as-judge evaluation loop; this post describes the LLM-as-labeler training-data loop that funds it.

Key takeaways¶

Relevance is the binding constraint on RAG answer quality. Because only a small subset of the retrieved corpus can be passed to the answering LLM, "the quality of search ranking—and the labeled relevance data used to train it—[is] critical to the quality of the final answer." (Source: sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search)
XGBoost over learned features, not hand-tuned rules. Dash's ranker is trained on (query, document, relevance-label) examples; signal weighting is learned via gradient boosting. Extends systems/dash-search-index — the ranker is the layer that consumes hybrid BM25 + dense-vector candidate sets and orders them for context-window insertion.
Graded 1–5 relevance scale, context-dependent. "Relevance isn't a fixed property of a document; it depends on the specific query, the user's context, and the moment the search is made." 5 = closely matches intent; 1 = not useful enough to show.
Three label provenance classes are named and compared — user-behaviour-inferred (click / skip; cheap, biased, sparse); human-labeled (comprehensive + consistent but expensive, slow, and blocked on customer-data privacy); and LLM-judged (cheap, consistent, multilingual, can analyse customer content within compliance boundaries). Behavioural signals are explicitly demoted to supplement, not replacement.
Human labeling is kept — but only as the calibration seed. A small team of human evaluators labels "a dataset that is orders of magnitude smaller than what would be required for full training"; those labels tune LLM prompts + model parameters; once judge quality meets threshold, the LLM generates the production training set. Formalised as patterns/human-calibrated-llm-labeling — a ~100× force-multiplier in the Dash diagram. Human review is done on internal, non-sensitive datasets only; no customer data is human-reviewed.
Latency + context-window rule out LLMs at query time. "Using LLMs directly at query time to replace traditional ranking models is not currently feasible due to context window limitations and latency constraints. Instead, Dash uses LLMs offline to generate high-quality training data." The LLM is a teacher for a small efficient production ranker — separation of training-time vs serving-time intelligence. Cognate with concepts/training-serving-boundary style reasoning.
MSE on the 1–5 scale is the judge-vs-human agreement metric. "Mean squared error (MSE), where the error ranges from 0 for exact agreement to 16 for the maximum possible disagreement" — small disagreements (4 vs 5) get small penalty, large disagreements (1 vs 5) get quadratically larger penalty. The full disagreement-reduction arc tracked against MSE plots prompt-refinement + reasoning-model + query-context-tooling + DSPy as four successive drops. Complements the named-step arc from sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash (which reported ~8% starting disagreement count, no MSE numbers).
Behaviour-discrepancy sampling focuses evaluation effort. "Users clicking on documents the LLM rated as low relevance, or consistently skipping documents the LLM rated as highly relevant" are flagged and routed to human review + prompt refinement. Bias the training set toward cases most likely to surface errors rather than uniform sampling. Formalised as patterns/behavior-discrepancy-sampling.
Judge query-context tooling solves the acronym problem programmatically. Example named in post: at Dropbox "diet sprite" is an internal performance-management tool, not a beverage. "Dash provides LLMs with tools that allow them to research query context before assigning relevance labels." The judge runs additional searches to disambiguate internal terminology before scoring — makes it behave like a human evaluator who'd consult internal tools. Extends concepts/rag-as-a-judge: the 2026-01-28 post framed judge retrieval as a step; this post adds that the judge is given tools (search queries, internal lookups), not just a static knowledge base. Formalised as patterns/judge-query-context-tooling.
DSPy named explicitly in the article as the prompt- optimisation framework used to automate prompt refinement against human-labeled examples. "DSPy can automatically refine prompts to better match human judgments." MSE is the optimisation objective. Consistent with patterns/prompt-optimizer-flywheel but here the driver is reducing MSE rather than minimising a disagreement-bullet set — complementary framings of the same loop.
Evaluation-function framing is load-bearing. Post draws two analogies: a chess engine pruning search branches via position-evaluation quality, and gradient-descent's dependence on gradient-signal accuracy. Both frame MSE-on-LLM-labels as the bottleneck: "Progress depends entirely on whether the evaluation signal accurately reflects improvement."
Grounding is structural, not one-shot. "Because LLM- generated labels are grounded in human-reviewed reference data, they can be continuously monitored, stress-tested, and re-calibrated as models, prompts, and product requirements change." The human seed set is the stable anchor against which prompt drift, model updates, and product requirement shifts are measured. Closed-loop mechanism for resisting judge drift over time.
Cross-modality extensibility. Labeling pipeline explicitly positioned as the shared mechanism that scales across documents / images / video / messages / chat as Dash expands — each domain has different relevance semantics but the human-calibration + LLM-scale-up shape is reusable.

Numbers / concrete details¶

Label scale multiplier: ~100× human → LLM labels (post diagram).
Relevance scale: 1–5 graded, context-dependent.
MSE range: 0 (exact agreement) to 16 (maximum disagreement) on the 1–5 scale.
Ranker algorithm: XGBoost-class learning-to-rank (gradient-boosted trees).
Training set size (LLM-generated): "hundreds of thousands—or even millions" of labels.
Human-review scope: "limited, non-sensitive internal datasets" only; no customer data.
Rubric dimensions: single scalar relevance score (1–5). No multi-dimensional rubric like Datadog's trajectory eval.
Prompt-optimisation framework named: DSPy (dspy.ai), described as "a library for programmatically optimizing LLM prompts against defined evaluation targets."
MSE improvement chart described (not reproduced): successive drops from (1) prompt refinement, (2) reasoning-optimised model, (3) query-context tooling, (4) DSPy automation.

Caveats¶

No pre/post MSE numbers disclosed. Chart is described but axes values not given in the article text. Companion [[sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash|2026-01-28 transcript]] gave ~8% starting-point disagreement count; this post adds the MSE-measurement shape but not the numbers.
Ranker architecture not detailed. XGBoost class named; no details on feature engineering, feature store integration (see systems/dash-feature-store), update cadence, or online vs offline re-ranking.
Human-labeler size + workflow not given. "Small group" is the only quantifier.
ACL + privacy mechanics glossed. Repeated statement that no customer data is human-reviewed; mechanics of how the LLM handles customer content within compliance boundaries not detailed.
Judge-tool inventory not enumerated. "Tools to research query context" is asserted but the tool list (which indices, which retrieval surface, which identity resolver) is not specified.
Cross-modal expansion is forward-looking. Images / video / messages / chat named as expansion targets but no architectural description of how relevance labeling differs for each.

Relationship to existing wiki pages¶

Upstream of sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash in the narrative: this post covers the training-data labeling loop that funds the judge; the 2026-01-28 transcript covers the judge-against-candidate evaluation loop built on top. Same components (DSPy, RAG as a judge, LLM judges), different end of the pipeline.
Extends systems/dash-search-index — this is the ranker layer downstream of the hybrid index; the labeling pipeline is how that ranker gets trained.
Promotes systems/dropbox-dash's learning-to-rank substrate from an implicit component into a named systems/dash-relevance-ranker page.
Extends concepts/llm-as-judge with the MSE-on-graded-scale scoring shape and the explicit judge-as-labeler framing (judge generates training data, not just scores candidates).
Extends concepts/rag-as-a-judge — Clemm's 2026-01-28 framing had the judge retrieve work-context; this post says the judge is given tools (active research surface), a stronger version.
Sibling to patterns/prompt-optimizer-flywheel — same DSPy loop, MSE-on-labels as the objective instead of disagreement-bullet-set minimisation.