CONCEPT Cited by 2 sources
RAG as a judge¶
RAG as a judge is the pattern of letting an LLM judge fetch its own context — via retrieval — before scoring a candidate response, rather than relying only on its pre-trained knowledge. The judge becomes a mini-RAG agent whose job is relevance scoring.
Why it matters¶
Classical LLM-as-judge assumes the judge already knows what it needs to evaluate. That assumption breaks in work domains:
"A big problem with using an LLM as a judge in a work context is that it doesn't know things like acronyms. If I were to say, 'What is RAG?'—and hopefully it knows what RAG is—what if it hasn't been trained on that? Sometimes, the judge needs to go get that context. And so, this is a little tongue-in-cheek, but we call this RAG as a judge. It can't just be using pre-computed information. Sometimes it has to go fetch some context itself."
— Josh Clemm, Dropbox Dash (Source: sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash)
Concrete failure without it¶
At Dropbox, an acronym like "RAG" could mean:
- "Retrieval-augmented generation" (external world)
- Some Dropbox-internal status code or project name (unknown to the base model)
- Both, depending on the user and the context
A judge relying only on its pre-training defaults to the public meaning and scores wrong when the correct answer is the internal meaning. The judge needs to resolve the term against the same context surface the candidate response was drawn from.
Mechanism¶
- Judge receives candidate. Question + retrieved docs + model response + evaluation rubric.
- Judge detects unknowns. Acronyms, project names, internal terms, people the judge hasn't been trained on.
- Judge retrieves context. Issues retrieval queries against the same knowledge base (or a judge-specific subset) to resolve the unknowns.
- Judge scores with resolved context. Now has enough domain grounding to apply the rubric correctly.
The key asymmetry: the candidate's RAG retrieves to answer the user's question; the judge's RAG retrieves to resolve the evaluation context. Same retrieval substrate, different query shape.
Where it fits in Dash's disagreement-reduction arc¶
Dash reports four cumulative steps, each lowering judge-vs-human disagreement:
- Baseline judge prompt. ~8% disagreement.
- Prompt refinement ("provide explanations for what you're doing"). Lower.
- Stronger model — upgraded to OpenAI o3 (reasoning model). Lower.
- RAG as a judge. Lower still.
- DSPy on top. Lower still.
RAG-as-a-judge is specifically the jump that addresses work-context vocabulary — unknowns the base reasoning model simply didn't have. (Source: sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash)
Tradeoffs¶
- Judge cost grows. Every evaluation now has a retrieval budget
- a judge-LLM-call budget. Batch evals 2–3× more expensive.
- Judge can retrieve wrong context. If the judge's retrieval is weak, it may pull unrelated docs and score against them; then the judge becomes its own bottleneck.
- Judge drift. Retrieval model + index version both affect scores; snapshot both alongside judge model version.
- Safety blind spot unchanged. "Can the judge retrieve the right grounding" is not the same as "is the recommended action safe." Action safety still needs separate guardrails.
Relationship to classical LLM-as-judge¶
RAG-as-a-judge is a strict extension: classical judge plus a retrieval step before scoring. For domains where the candidate's response depends on internal-only knowledge, it closes a gap the classical shape leaves open.
Seen in¶
- sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash — named by Josh Clemm as a pragmatic Dash addition to the judge loop; the "acronym problem" is the canonical example.
- sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search — extends to the tool-using agentic form as patterns/judge-query-context-tooling. The 2026-01-28 transcript had the judge retrieve context; the 2026-02-26 post names "Dash provides LLMs with tools that allow them to research query context before assigning relevance labels" — a judge armed with tools (search queries, entity resolution) rather than a single RAG call. Canonical example: "diet sprite" = internal Dropbox performance-management tool, not a soft drink — the judge has to issue additional searches to disambiguate before scoring. Positioned as letting the judge "go deeper than human evaluators would in practice."
Related¶
- concepts/llm-as-judge — parent pattern.
- patterns/judge-query-context-tooling — tool-using generalisation; the judge is given retrieval + resolution tools rather than a single RAG call.
- patterns/human-calibrated-llm-labeling — the outer labeling loop at Dash where RAG-as-a-judge sits.
- patterns/prompt-optimizer-flywheel — Dash's full quality loop, with DSPy-driven bullet-point disagreement reduction on top of RAG-as-judge.
- systems/dash-search-index — the retrieval substrate the judge re-uses.
- concepts/knowledge-graph — the graph's canonical-ID resolution is particularly helpful to the judge when the ambiguous term is a person / project / doc.