PATTERN Cited by 2 sources

Judge query-context tooling¶

Judge query-context tooling is the pattern of giving an LLM judge active retrieval tools (search queries, lookups, knowledge-base access) and letting it research the evaluation context before scoring — mimicking what a human evaluator does when they hit an unfamiliar acronym. It is a tool-using extension of concepts/rag-as-a-judge.

The problem¶

When an LLM judge scores relevance in a work domain, the query vocabulary is often organisation-specific:

Internal project names. "What's the Darwin status?" — Darwin could be a framework, a task force, a datacenter.
Product-internal acronyms. Many acronyms have distinct public and internal meanings.
Tribal knowledge. Mailing-list nicknames, team shorthand.

The canonical Dropbox example:

"Within Dropbox, the term 'diet sprite' refers to an internal performance management tool rather than a soft drink, a distinction that can be difficult for LLMs to infer without additional context." (Source: sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search)

A judge applying pre-training defaults scores against the public meaning → systematic mis-rating for a whole class of queries.

The mechanism¶

Instead of handing the judge a static retrieved-context block, expose tools:

Query-the-index tools. Same search interface the ranker uses, but callable by the judge on its own initiative.
Acronym / entity resolution tools. Lookup against a knowledge-graph (e.g. people / projects / docs), resolution of internal terms to canonical entities.
Follow-up-search tools. Given an ambiguous result, run refinement queries.

Framing from the post:

"Dash provides LLMs with tools that allow them to research query context before assigning relevance labels. Once the LLM understands the user's intent, it can apply consistent, context-aware relevance labeling across large candidate result sets, often going deeper than human evaluators would in practice."

The judge becomes a mini-agent: disambiguate the query, resolve internal terms, then apply the rubric.

Relationship to concepts/rag-as-a-judge ¶

Both patterns give the judge retrieval. The distinction:

Aspect	RAG-as-a-judge	Judge query-context tooling
Retrieval shape	Single RAG query before scoring	Multi-step tool-using agent loop
Scope of retrieval	Fixed: "fetch context for this query"	Flexible: judge decides what to look up
Failure it addresses	Judge doesn't know the term	Judge doesn't know how to disambiguate until it explores
Implementation	Static: one retrieval call	Dynamic: tool inventory exposed to judge

The 2026-01-28 Dash transcript introduced RAG-as-a-judge; the 2026-02-26 labeling post extends it — the judge is now given tools (plural, agentic) and goes deeper than a single RAG call.

Why it can out-perform human evaluators on some dimensions¶

Post-stated claim:

"often going deeper than human evaluators would in practice."

Human evaluators stop researching when it gets costly. An LLM with cheap tool calls can exhaustively disambiguate every ambiguous token in the query + candidate docs, producing more uniformly grounded scores than a human evaluator who eyeballed it.

Tradeoffs¶

Judge cost grows substantially. Every eval becomes an agent trace: multiple tool calls + LLM-reasoning rounds. Batch evals can 3–5× more expensive than single-call judging.
Tool-selection failures propagate. If the judge picks a bad retrieval query, it may pull unrelated context and score against it — now the judge is its own retrieval bottleneck. Cf. concepts/tool-selection-accuracy — same failure mode as agentic retrieval, inherited by agentic judging.
Tool-inventory bloat. Tool surface minimisation applies: too many tools → judge picks wrong one. Keep the judge tool set narrow (retrieval + resolution, not arbitrary actions).
Latency. Not usually a concern because judge loops run offline, but trace length matters for throughput of training label generation.
Still doesn't solve safety. Grounded judgment ≠ safe action. Action safety remains a separate concern.

When to reach for it¶

You're applying concepts/llm-as-judge over work-context content with significant internal vocabulary.
Static RAG-as-a-judge still shows systematic under-performance on organisation-specific queries (acronyms, project names, people).
You already have a retrieval surface the judge can share with candidates (Dash re-uses its main search index).
Offline-only evaluation — serving-path latency is not the constraint.

Seen in¶

sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search — named pattern; "diet sprite" canonical example; explicitly framed as the mechanism that lets the judge "go deeper than human evaluators would in practice"; sits inside the human-calibrated LLM labeling loop as one of the axes DSPy / prompt iteration tune.
sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash — the predecessor framing: "Sometimes the judge needs to go get that context" is the seed of this pattern; the 2026-02-26 post upgrades it from single-retrieval to tool-using.

concepts/rag-as-a-judge — the predecessor single-retrieval framing; this pattern is the agentic generalisation.
concepts/llm-as-judge — the base evaluator primitive.
concepts/tool-selection-accuracy — inherited failure mode from making the judge tool-using.
patterns/human-calibrated-llm-labeling — the outer loop this pattern lives inside at Dash.
patterns/tool-surface-minimization — the discipline that keeps the judge's tool inventory small.
systems/dash-search-index — the retrieval substrate the judge shares with candidates.
systems/dash-relevance-ranker — the model the judge labels train.