Skip to content

PATTERN Cited by 2 sources

Judge query-context tooling

Judge query-context tooling is the pattern of giving an LLM judge active retrieval tools (search queries, lookups, knowledge-base access) and letting it research the evaluation context before scoring — mimicking what a human evaluator does when they hit an unfamiliar acronym. It is a tool-using extension of concepts/rag-as-a-judge.

The problem

When an LLM judge scores relevance in a work domain, the query vocabulary is often organisation-specific:

  • Internal project names. "What's the Darwin status?" — Darwin could be a framework, a task force, a datacenter.
  • Product-internal acronyms. Many acronyms have distinct public and internal meanings.
  • Tribal knowledge. Mailing-list nicknames, team shorthand.

The canonical Dropbox example:

"Within Dropbox, the term 'diet sprite' refers to an internal performance management tool rather than a soft drink, a distinction that can be difficult for LLMs to infer without additional context." (Source: sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search)

A judge applying pre-training defaults scores against the public meaning → systematic mis-rating for a whole class of queries.

The mechanism

Instead of handing the judge a static retrieved-context block, expose tools:

  1. Query-the-index tools. Same search interface the ranker uses, but callable by the judge on its own initiative.
  2. Acronym / entity resolution tools. Lookup against a knowledge-graph (e.g. people / projects / docs), resolution of internal terms to canonical entities.
  3. Follow-up-search tools. Given an ambiguous result, run refinement queries.

Framing from the post:

"Dash provides LLMs with tools that allow them to research query context before assigning relevance labels. Once the LLM understands the user's intent, it can apply consistent, context-aware relevance labeling across large candidate result sets, often going deeper than human evaluators would in practice."

The judge becomes a mini-agent: disambiguate the query, resolve internal terms, then apply the rubric.

Relationship to concepts/rag-as-a-judge

Both patterns give the judge retrieval. The distinction:

Aspect RAG-as-a-judge Judge query-context tooling
Retrieval shape Single RAG query before scoring Multi-step tool-using agent loop
Scope of retrieval Fixed: "fetch context for this query" Flexible: judge decides what to look up
Failure it addresses Judge doesn't know the term Judge doesn't know how to disambiguate until it explores
Implementation Static: one retrieval call Dynamic: tool inventory exposed to judge

The 2026-01-28 Dash transcript introduced RAG-as-a-judge; the 2026-02-26 labeling post extends it — the judge is now given tools (plural, agentic) and goes deeper than a single RAG call.

Why it can out-perform human evaluators on some dimensions

Post-stated claim:

"often going deeper than human evaluators would in practice."

Human evaluators stop researching when it gets costly. An LLM with cheap tool calls can exhaustively disambiguate every ambiguous token in the query + candidate docs, producing more uniformly grounded scores than a human evaluator who eyeballed it.

Tradeoffs

  • Judge cost grows substantially. Every eval becomes an agent trace: multiple tool calls + LLM-reasoning rounds. Batch evals can 3–5× more expensive than single-call judging.
  • Tool-selection failures propagate. If the judge picks a bad retrieval query, it may pull unrelated context and score against it — now the judge is its own retrieval bottleneck. Cf. concepts/tool-selection-accuracy — same failure mode as agentic retrieval, inherited by agentic judging.
  • Tool-inventory bloat. Tool surface minimisation applies: too many tools → judge picks wrong one. Keep the judge tool set narrow (retrieval + resolution, not arbitrary actions).
  • Latency. Not usually a concern because judge loops run offline, but trace length matters for throughput of training label generation.
  • Still doesn't solve safety. Grounded judgment ≠ safe action. Action safety remains a separate concern.

When to reach for it

  • You're applying concepts/llm-as-judge over work-context content with significant internal vocabulary.
  • Static RAG-as-a-judge still shows systematic under-performance on organisation-specific queries (acronyms, project names, people).
  • You already have a retrieval surface the judge can share with candidates (Dash re-uses its main search index).
  • Offline-only evaluation — serving-path latency is not the constraint.

Seen in

Last updated · 200 distilled / 1,178 read