SYSTEM Cited by 5 sources

Dash Relevance Ranker¶

Definition¶

Dash Relevance Ranker is the learning-to-rank model that scores each candidate document against a query inside Dash's unified search index and produces the top-K ordering passed to the answering LLM. The 2026-02-26 Dropbox Tech post names it as XGBoost-class (gradient-boosted trees) — "trained using machine learning techniques such as XGBoost rather than manually tuned rules" (Source: sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search).

Previously described indirectly on systems/dash-search-index as the "multiple ranking passes; per-query + per-user relevance combining lexical match, vector similarity, and graph-derived signals. Personalized and ACL'd to you." (Source: sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash). This page is the primary landing for the ranker itself.

Where it sits¶

Pipeline (Dash retrieval, annotated):

query
  │
  ▼
Dash Search Index (hybrid BM25 + dense vectors + knowledge-bundles)
  │  returns candidate set
  ▼
Dash Relevance Ranker (XGBoost; per-(query,doc) features) ◄── features from Dash Feature Store
  │  scores + orders
  ▼
Top-K → answering LLM's context window

Training signal¶

The ranker is trained on graded 1–5 relevance labels (concepts/relevance-labeling) over (query, document) pairs. The labels are not hand-produced at scale — they come from the human-calibrated LLM labeling pipeline:

Small human-labeled seed set (internal, non-sensitive data only).
LLM judge calibrated against the seed set via MSE on the 1–5 scale (range 0–16).
Calibrated judge produces hundreds of thousands to millions of labels.
Labels train XGBoost on (query, doc) features.

Production NDCG is the model-quality metric; the ranker is iterated by measuring NDCG on held-out judge-labeled slices.

Features (inferred)¶

Not exhaustively enumerated in any single source. Named signals across the Dash posts:

Lexical match via BM25 scores.
Dense-vector similarity via embedding scores (concepts/hybrid-retrieval-bm25-vectors).
Knowledge-graph-derived signals — people / activity / content edges re-projected through the hybrid index as "knowledge bundles" (patterns/canonical-entity-id, patterns/precomputed-relevance-graph).
User personalisation features — identity-conditioned re-ranking; specific features loaded via the Dash Feature Store in sub-100ms (p95 ~25–35ms) per query across thousands of parallel lookups.
ACL-conditioned filtering — hard constraint applied at ranking time; not a soft feature.

Why XGBoost, not an LLM, at query time¶

Explicit framing in the labeling post:

"Using LLMs directly at query time to replace traditional ranking models is not currently feasible due to context window limitations and latency constraints. Instead, Dash uses LLMs offline to generate high-quality training data."

The LLM is the teacher; XGBoost is the student that runs at serving time. Classical training-vs-serving split in ML systems.

Relationship to the labeling pipeline¶

The ranker's quality is bounded by the quality of its relevance labels. Three levers in the labeling pipeline feed ranker quality directly:

patterns/human-calibrated-llm-labeling — the overall shape (seed → calibrate → scale).
patterns/behavior-discrepancy-sampling — which labels get human review.
patterns/judge-query-context-tooling — judge armed with retrieval tools so organisation-specific queries are labeled correctly.

All three are described in the 2026-02-26 labeling post; the 2026-01-28 transcript adds the DSPy flywheel that closes the loop with automated prompt tuning.

Current ranker: text-centric (documents, messages, snippets). Dash's forward plans (from the post): extend to images, video, messages, chat. Each modality encodes relevance differently and may need its own feature extractors + sub-model. The labeling pipeline is explicitly positioned as the shared mechanism that scales across modalities:

"Human-calibrated LLM evaluation provides a shared mechanism for adapting relevance judgments across modalities without rebuilding labeling pipelines or redefining evaluation criteria from scratch."

Caveats¶

Feature list incomplete. Dropbox has not published a fully-enumerated feature spec.
Training cadence not disclosed. No information on retraining frequency, online vs offline, or A/B rollout methodology for new ranker versions.
No latency numbers for the ranker itself. The feature-store budget is p95 25–35ms for feature fetches; how much of the remaining sub-100ms per-query budget the XGBoost model consumes is not stated.
Multi-pass architecture hinted, not described. "Multiple ranking passes" appears in the 2026-01-28 transcript — likely coarse + fine re-ranker stages — but the stage boundary isn't specified.
Not confirmed as literal XGBoost. The 2026-02-26 post says "such as XGBoost" — the actual production model may be a related gradient-boosted-tree implementation (LightGBM, CatBoost, custom), not stock XGBoost.

Seen in¶

sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search — primary source. Names XGBoost; states learning-from-examples framing; states the LLM-at-query-time-infeasible boundary; positions the labeling pipeline as the quality bottleneck on the ranker and therefore on Dash answers.
sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash — "multiple ranking passes, personalized and ACL'd"; NDCG named as scoring metric; knowledge-graph-derived signals lifted NDCG.
sources/2025-11-17-dropbox-how-dash-uses-context-engineering-for-smarter-ai — ranker as the mechanism that filters context for the agent; the pre-filtering discipline underneath concepts/context-engineering.
sources/2025-12-18-dropbox-feature-store-powering-real-time-ai-dash — feature-serving latency budget (sub-100ms, p95 25–35ms) inside which the ranker operates.
sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy — upstream judge-adaptation angle on the ranker's training pipeline. The relevance judge whose labels train this ranker was adapted across o3 / gpt-oss-120b / gemma-3-12b via DSPy GEPA / MIPROv2. Concrete deltas: NMSE cut 45% on gpt-oss-120b; 10–100× more training labels at same cost. The ranker's training set is therefore both larger and more human-aligned than the prior-generation judge could produce at the same budget. NDCG impact on the ranker itself not reported in this post — the chain NMSE-on-judge → label quality → ranker NDCG has one reported step (NMSE) and one unreported step (NDCG).

systems/dash-search-index — the retrieval surface whose candidates this ranker orders.
systems/dash-feature-store — the online feature store supplying per-query features inside the sub-100ms budget.
systems/dropbox-dash — the product the ranker serves.
concepts/relevance-labeling — the training signal.
concepts/ndcg — the quality metric.
concepts/llm-as-judge — source of the training labels.
patterns/human-calibrated-llm-labeling — the labeling pipeline feeding training.