CONCEPT Cited by 2 sources
NDCG (Normalized Discounted Cumulative Gain)¶
Normalized Discounted Cumulative Gain (NDCG) is a standard information-retrieval metric for ranked list quality. It measures how well a ranked retrieval result matches the ideal ordering, weighted so that swaps near the top of the list cost more than swaps deep in the list.
Mechanical summary¶
- Relevance labels — each retrieved item gets a relevance score (0, 1, 2, 3, … or continuous). Labels usually come from human judgments or — increasingly — LLM judges.
- DCG (Discounted Cumulative Gain). Sum the relevance of each
position, discounted by rank:
DCG_k = sum_{i=1..k} rel_i / log_2(i + 1). Closer-to-top = higher weight. - Ideal DCG (IDCG). What the DCG would be if the list were perfectly sorted by relevance.
- NDCG = DCG / IDCG. Normalized to [0, 1]; 1.0 is a perfect ranking.
Why top-of-list weighting matters¶
For agent retrieval, the top-K results enter the concepts/agent-context-window. A ranking error that buries a relevant doc at position 50 costs more than burying one at position 500 — the agent likely never sees the one at 500 because it's outside the context budget anyway. NDCG captures this asymmetry directly; flat metrics like precision@K or recall@K don't.
Usage at Dash¶
Dash uses NDCG to score retrieval results during iteration:
"We use normalized discounted cumulative gain (NDCG) a lot to score the results to retrieve. But just by doing this people-based result we saw some really nice wins."
— Josh Clemm, Dropbox Dash (Source: sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash)
The "people-based result" point is that adding knowledge-graph edges (people + their activities + their docs) to the ranker lifted NDCG — concrete evidence that graph-derived signals pay off on a standard IR metric, not just on subjective quality bars.
When to use NDCG vs alternatives¶
- NDCG — graded relevance (0/1/2/3/…), want to reward near-perfect ordering, care about top-K disproportionately.
- MRR (Mean Reciprocal Rank) — binary "is there one correct answer"; good for single-answer queries like "which doc is the spec?".
- Recall@K / Precision@K — binary relevance, don't care about within-top-K order; weaker signal for agent retrieval.
- MAP (Mean Average Precision) — binary relevance with order mattering; NDCG generalizes this to graded relevance.
Tradeoffs¶
- Requires graded relevance labels. Producing them is expensive; hence concepts/llm-as-judge automation.
- Sensitive to label noise. If judges disagree (human or LLM), NDCG's nuance becomes false precision. Pair with alignment studies.
- Not a product metric. NDCG correlates with user satisfaction but isn't identical. Complement with click-through / task-success / retention signals.
- Doesn't capture diversity. Two near-duplicate top hits vs one top hit + one diverse follow-up can NDCG-score similarly but feel very different to a user / agent.
Seen in¶
- sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash — named metric for Dash's retrieval iteration; specifically cited as the scoring basis for the "people-based" knowledge-graph wins.
- sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search — NDCG positioned as the production model-quality metric for Dash's XGBoost ranker; the graded concepts/relevance-labeling|1–5 relevance labels NDCG consumes come from the human-calibrated LLM labeling pipeline, clarifying that NDCG validity is downstream of label quality — the chain is label pipeline → trusted labels → trusted NDCG → trusted ranker iteration.
Related¶
- concepts/llm-as-judge — the typical source of graded labels in modern agentic retrieval systems.
- concepts/knowledge-graph — additional ranking signal whose impact is measured against NDCG.
- patterns/precomputed-relevance-graph — the production realization; NDCG is the eval metric that determines whether graph signals actually improved ranking.
- systems/dash-search-index — where the ranker lives that is measured against NDCG.