Skip to content

CONCEPT Cited by 6 sources

LLM as Judge

LLM-as-judge is the evaluation pattern in which one LLM scores another model's (or agent's) output against a rubric — accuracy, helpfulness, policy adherence, structural correctness — replacing or augmenting human evaluation during iteration on non-deterministic components.

Why it matters

LLM outputs and agent trajectories are non-deterministic: the same prompt can produce different answers across runs. Classical unit/integration tests with exact-match assertions become flaky or over-constrained. A judge LLM gives a structural score ("did this response correctly identify the slow query? did it recommend a safe action?") that tolerates wording variation while catching true regressions.

Typical usage

  • Regression harness. Snapshot production-state inputs → replay through candidate agent configs → judge scores new outputs against reference.
  • Leaderboard evaluation. Score many model/prompt/tool combinations against a shared rubric to pick the next rollout candidate.
  • Per-response observability. Post-hoc judge scoring on production traffic to flag degrading responses.
  • Trajectory scoring. Score the full agent trace (tool calls + intermediate reasoning + final answer) against a ground-truth RCA with a depth / completeness / telemetry-surfaced rubric. Surfaces partial credit and reasoning-shape regressions that terminal-answer scoring misses.
  • Up-funnel label-pipeline validation. Judge-style scoring applied before evaluation — confidence scores on agent-assisted-generated labels, filtering sub-threshold labels to human review. See patterns/agent-assisted-label-validation. Datadog shipped this on top of Bits AI SRE as the mechanism that cut label-validation time >95%.
  • Judge-as-labeler for training data. LLM judge scores (query, document) pairs on a graded relevance scale (concepts/relevance-labeling); the scored pairs become the supervised training set for a downstream learning-to-rank model (patterns/human-calibrated-llm-labeling). Humans don't label the training set — humans label a small seed set that calibrates the judge. Dropbox Dash's production pattern (see systems/dash-relevance-ranker); force multiplier ~100×.
  • Plan-sufficiency judgment inside the generation loop. Judge scores whether the current plan is adequate at each step of an agent's inner loop, gating add-or-fix refinement — not scoring a completed trajectory or output post-hoc. Google Research's DS-STAR is the canonical wiki instance: the Verifier agent "is an LLM-based judge prompted to determine if the current plan is adequate"; on reject, a Router agent chooses add-vs-fix on the plan and the loop repeats (up to 10 rounds). See patterns/planner-coder-verifier-router-loop and concepts/iterative-plan-refinement. This is load-bearing specifically for open-ended problems "that lack ground-truth labels, making it difficult to verify if an agent's reasoning is correct."

Tradeoffs / gotchas

  • The judge has its own biases. A judge LLM can reward verbose or confident-sounding wrong answers. Rubrics must be specific enough to resist these modes.
  • Judge drift. When the judge model is updated, past scores are not directly comparable. Snapshot the judge version alongside the run.
  • Cost. Every replay requires both the candidate and the judge; a batch eval is 2× the inference cost.
  • Not a safety proof. Judging "did the response look good" is not the same as "was the recommended action safe". Mutating actions still need guardrails outside the judge loop.

Seen in

  • sources/2025-12-03-databricks-ai-agent-debug-databases — Databricks' systems/storex uses a snapshot-replay harness scored by a judge LLM as its primary regression signal during prompt / tool iteration, referencing MLflow 3's judges documentation.
  • sources/2026-04-07-datadog-bits-ai-sre-eval-platform — Datadog's Bits AI SRE evaluation platform uses judge-style scoring in two places: (a) as the rubric-based scorer over agent trajectories (trajectory evaluation, pass@k, final-answer correctness across tens of thousands of scenarios); (b) as an up-funnel confidence scorer on generated labels themselves — the patterns/agent-assisted-label-validation pipeline passes candidate ground-truth RCAs through confidence scoring across thoroughness / specificity / accuracy dimensions, flagging sub-threshold labels for human review. Alignment studies with human judges gated the trust transition; label-validation time dropped >95% once this was operational.

  • sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash — Dropbox Dash's LLM-as-judge iteration arc is a canonical "retrieval-relevance judge" case. Named four-step disagreement-reduction sequence, each step cumulatively lowering judge-vs-human disagreement: (1) baseline judge prompt — ~8% disagreement. (2) prompt refinement — "provide explanations for what you're doing" (classic chain-of-thought framing) — disagreement lower. (3) stronger model — upgraded to OpenAI's o3 reasoning model — lower still. (4) RAG as a judge — let the judge fetch work-context for domain-specific acronyms / terminology it wasn't trained on ("What is RAG?" might mean something internal to Dropbox that the model hasn't seen) — lower still. (5) DSPy layered on top via the patterns/prompt-optimizer-flywheel (bullet-pointed disagreements → optimizer → reduced-disagreement prompt) — "quite pleased with some of the results." LLM-as-judge explicitly called out as a prime DSPy target because judges have "crystal clear rubrics and evals… you know exactly what the outcome should be." Also: Dash uses NDCG scored against judge / human labels as the retrieval quality metric.

  • sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search — companion to the Clemm transcript, covering the LLM-as-labeler side: LLM judges generate hundreds of thousands to millions of relevance labels used to train Dash's XGBoost ranker (systems/dash-relevance-ranker). Adds (a) Mean Squared Error on the 1–5 graded scale (range 0–16) as the named judge-vs-human agreement metric — small disagreements get small penalty, wide disagreements get quadratically larger penalty; (b) the human-calibrated LLM labeling pipeline (patterns/human-calibrated-llm-labeling) — humans label a seed set, calibrate the judge, the judge amplifies ~100×; (c) patterns/behavior-discrepancy-sampling — sample for human review the cases where user behaviour disagrees with the LLM's judgment (clicks on low-rated results / skips on high-rated results); (d) patterns/judge-query-context-tooling — the judge is given retrieval tools to research query context before scoring (canonical example: "diet sprite" = internal performance-management tool, not a drink), generalising concepts/rag-as-a-judge from single-retrieval to tool-using agent. DSPy named as prompt-optimisation framework minimising MSE. Explicit framing: "LLMs make it possible to apply human judgment consistently and at scale, rather than replacing it."

  • sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspymodel-adaptation edition: same Dash relevance judge retargeted across o3 / gpt-oss-120b / gemma-3-12b via DSPy. Introduces two quality axes as first-class peers for an LLM judge: (a) alignment NMSE on a 1–5 scale rescaled to 0–100 (cut 8.83 → 4.86, −45% on gpt-oss-120b); (b) operational reliabilityJSON validity rate (malformed-JSON cut 42% → <1.1% on gemma-3-12b, a 97%+ reduction, with malformed counted as fully incorrect). DSPy's GEPA optimiser drives alignment via feedback strings; DSPy's MIPROv2 optimiser drives both axes simultaneously on small models. Adds three usage regimes by risk tolerance: patterns/cross-model-prompt-adaptation (full rewrite, cheap targets), patterns/instruction-library-prompt-composition (constrained bullet-selection, production o3), and the fixed-model patterns/prompt-optimizer-flywheel. Formalises overfitting guardrails (invariants live in the feedback string). Cycle time: 1–2 weeks manual → 1–2 days with DSPy. Label coverage: 10–100× at same cost on cheaper targets.

  • sources/2025-11-06-google-ds-star-versatile-data-science-agentplan-sufficiency edition: the judge sits inside the agent's generation loop, not in an eval harness. Google Research's DS-STAR uses a Verifier LLM to score whether the current plan is adequate after each Coder execution; the loop extends or repairs the plan (via a separate Router agent) on reject and exits on approval or at 10 rounds. The primitive is load-bearing because open-ended data-science problems "lack ground-truth labels, making it difficult to verify if an agent's reasoning is correct" — the judge is the only feasible structural check. Empirical round-count distribution on DABStep: 3.0 avg rounds on easy tasks (>50 % single-round) vs 5.6 on hard tasks; state-of-the-art accuracy (45.2 % on DABStep vs 41.0 % prior best). This extends the wiki's LLM-as-judge framing along a new axis — in the loop that generates the thing being judged, co-evolving with the plan rather than scoring a fixed artefact. Canonical realisation is patterns/planner-coder-verifier-router-loop; loop-level concept is concepts/iterative-plan-refinement; termination discipline is concepts/refinement-round-budget.

  • sources/2026-04-20-cloudflare-orchestrating-ai-code-review-at-scalecoordinator judge-pass edition: the judge is the coordinator agent itself consolidating output from N specialised sub-reviewers. After the seven sub-reviewers return structured XML findings, the coordinator (running on the top model tier — Opus 4.7 / GPT-5.4) performs a judge pass that: (1) deduplicates findings surfaced by multiple reviewers; (2) re-categorises — a perf bug flagged by code-quality moves to the perf section; (3) applies a reasonableness filter dropping speculative issues, nitpicks, false positives, convention-contradicted findings; (4) when uncertain, uses its tools to read the source code and verify before emitting. Distinguishes from prior LLM-as-judge instances because the judge is (a) wrapped inside a production pipeline with GitLab actions as its downstream effects, (b) scored in production by how often humans invoke break glass (0.6% of MRs), and (c) responsible for writing a natural-language verdict that ships as the single GitLab review comment. Canonical wiki instance of the coordinator- pattern variant of LLM-as-judge (see patterns/coordinator-sub-reviewer-orchestration).

Last updated · 200 distilled / 1,178 read