CONCEPT Cited by 2 sources

NDCG (Normalized Discounted Cumulative Gain)¶

Normalized Discounted Cumulative Gain (NDCG) is a standard information-retrieval metric for ranked list quality. It measures how well a ranked retrieval result matches the ideal ordering, weighted so that swaps near the top of the list cost more than swaps deep in the list.

Mechanical summary¶

Relevance labels — each retrieved item gets a relevance score (0, 1, 2, 3, … or continuous). Labels usually come from human judgments or — increasingly — LLM judges.
DCG (Discounted Cumulative Gain). Sum the relevance of each position, discounted by rank: DCG_k = sum_{i=1..k} rel_i / log_2(i + 1). Closer-to-top = higher weight.
Ideal DCG (IDCG). What the DCG would be if the list were perfectly sorted by relevance.
NDCG = DCG / IDCG. Normalized to [0, 1]; 1.0 is a perfect ranking.

Why top-of-list weighting matters¶

For agent retrieval, the top-K results enter the concepts/agent-context-window. A ranking error that buries a relevant doc at position 50 costs more than burying one at position 500 — the agent likely never sees the one at 500 because it's outside the context budget anyway. NDCG captures this asymmetry directly; flat metrics like precision@K or recall@K don't.

Usage at Dash¶

Dash uses NDCG to score retrieval results during iteration:

"We use normalized discounted cumulative gain (NDCG) a lot to score the results to retrieve. But just by doing this people-based result we saw some really nice wins."

— Josh Clemm, Dropbox Dash (Source: sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash)

The "people-based result" point is that adding knowledge-graph edges (people + their activities + their docs) to the ranker lifted NDCG — concrete evidence that graph-derived signals pay off on a standard IR metric, not just on subjective quality bars.

When to use NDCG vs alternatives¶

NDCG — graded relevance (0/1/2/3/…), want to reward near-perfect ordering, care about top-K disproportionately.
MRR (Mean Reciprocal Rank) — binary "is there one correct answer"; good for single-answer queries like "which doc is the spec?".
Recall@K / Precision@K — binary relevance, don't care about within-top-K order; weaker signal for agent retrieval.
MAP (Mean Average Precision) — binary relevance with order mattering; NDCG generalizes this to graded relevance.

Tradeoffs¶

Requires graded relevance labels. Producing them is expensive; hence concepts/llm-as-judge automation.
Sensitive to label noise. If judges disagree (human or LLM), NDCG's nuance becomes false precision. Pair with alignment studies.
Not a product metric. NDCG correlates with user satisfaction but isn't identical. Complement with click-through / task-success / retention signals.
Doesn't capture diversity. Two near-duplicate top hits vs one top hit + one diverse follow-up can NDCG-score similarly but feel very different to a user / agent.

Seen in¶

sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash — named metric for Dash's retrieval iteration; specifically cited as the scoring basis for the "people-based" knowledge-graph wins.
sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search — NDCG positioned as the production model-quality metric for Dash's XGBoost ranker; the graded concepts/relevance-labeling|1–5 relevance labels NDCG consumes come from the human-calibrated LLM labeling pipeline, clarifying that NDCG validity is downstream of label quality — the chain is label pipeline → trusted labels → trusted NDCG → trusted ranker iteration.

concepts/llm-as-judge — the typical source of graded labels in modern agentic retrieval systems.
concepts/knowledge-graph — additional ranking signal whose impact is measured against NDCG.
patterns/precomputed-relevance-graph — the production realization; NDCG is the eval metric that determines whether graph signals actually improved ranking.
systems/dash-search-index — where the ranker lives that is measured against NDCG.