Skip to content

CONCEPT Cited by 2 sources

NDCG (Normalized Discounted Cumulative Gain)

Normalized Discounted Cumulative Gain (NDCG) is a standard information-retrieval metric for ranked list quality. It measures how well a ranked retrieval result matches the ideal ordering, weighted so that swaps near the top of the list cost more than swaps deep in the list.

Mechanical summary

  1. Relevance labels — each retrieved item gets a relevance score (0, 1, 2, 3, … or continuous). Labels usually come from human judgments or — increasingly — LLM judges.
  2. DCG (Discounted Cumulative Gain). Sum the relevance of each position, discounted by rank: DCG_k = sum_{i=1..k} rel_i / log_2(i + 1). Closer-to-top = higher weight.
  3. Ideal DCG (IDCG). What the DCG would be if the list were perfectly sorted by relevance.
  4. NDCG = DCG / IDCG. Normalized to [0, 1]; 1.0 is a perfect ranking.

Why top-of-list weighting matters

For agent retrieval, the top-K results enter the concepts/agent-context-window. A ranking error that buries a relevant doc at position 50 costs more than burying one at position 500 — the agent likely never sees the one at 500 because it's outside the context budget anyway. NDCG captures this asymmetry directly; flat metrics like precision@K or recall@K don't.

Usage at Dash

Dash uses NDCG to score retrieval results during iteration:

"We use normalized discounted cumulative gain (NDCG) a lot to score the results to retrieve. But just by doing this people-based result we saw some really nice wins."

— Josh Clemm, Dropbox Dash (Source: sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash)

The "people-based result" point is that adding knowledge-graph edges (people + their activities + their docs) to the ranker lifted NDCG — concrete evidence that graph-derived signals pay off on a standard IR metric, not just on subjective quality bars.

When to use NDCG vs alternatives

  • NDCG — graded relevance (0/1/2/3/…), want to reward near-perfect ordering, care about top-K disproportionately.
  • MRR (Mean Reciprocal Rank) — binary "is there one correct answer"; good for single-answer queries like "which doc is the spec?".
  • Recall@K / Precision@K — binary relevance, don't care about within-top-K order; weaker signal for agent retrieval.
  • MAP (Mean Average Precision) — binary relevance with order mattering; NDCG generalizes this to graded relevance.

Tradeoffs

  • Requires graded relevance labels. Producing them is expensive; hence concepts/llm-as-judge automation.
  • Sensitive to label noise. If judges disagree (human or LLM), NDCG's nuance becomes false precision. Pair with alignment studies.
  • Not a product metric. NDCG correlates with user satisfaction but isn't identical. Complement with click-through / task-success / retention signals.
  • Doesn't capture diversity. Two near-duplicate top hits vs one top hit + one diverse follow-up can NDCG-score similarly but feel very different to a user / agent.

Seen in

Last updated · 200 distilled / 1,178 read