CONCEPT Cited by 13 sources
LLM as Judge¶
LLM-as-judge is the evaluation pattern in which one LLM scores another model's (or agent's) output against a rubric — accuracy, helpfulness, policy adherence, structural correctness — replacing or augmenting human evaluation during iteration on non-deterministic components.
Why it matters¶
LLM outputs and agent trajectories are non-deterministic: the same prompt can produce different answers across runs. Classical unit/integration tests with exact-match assertions become flaky or over-constrained. A judge LLM gives a structural score ("did this response correctly identify the slow query? did it recommend a safe action?") that tolerates wording variation while catching true regressions.
Typical usage¶
- Regression harness. Snapshot production-state inputs → replay through candidate agent configs → judge scores new outputs against reference.
- Leaderboard evaluation. Score many model/prompt/tool combinations against a shared rubric to pick the next rollout candidate.
- Per-response observability. Post-hoc judge scoring on production traffic to flag degrading responses.
- Trajectory scoring. Score the full agent trace (tool calls + intermediate reasoning + final answer) against a ground-truth RCA with a depth / completeness / telemetry-surfaced rubric. Surfaces partial credit and reasoning-shape regressions that terminal-answer scoring misses.
- Up-funnel label-pipeline validation. Judge-style scoring applied before evaluation — confidence scores on agent-assisted-generated labels, filtering sub-threshold labels to human review. See patterns/agent-assisted-label-validation. Datadog shipped this on top of Bits AI SRE as the mechanism that cut label-validation time >95%.
- Judge-as-labeler for training data. LLM judge scores (query, document) pairs on a graded relevance scale (concepts/relevance-labeling); the scored pairs become the supervised training set for a downstream learning-to-rank model (patterns/human-calibrated-llm-labeling). Humans don't label the training set — humans label a small seed set that calibrates the judge. Dropbox Dash's production pattern (see systems/dash-relevance-ranker); force multiplier ~100×.
- Plan-sufficiency judgment inside the generation loop. Judge scores whether the current plan is adequate at each step of an agent's inner loop, gating add-or-fix refinement — not scoring a completed trajectory or output post-hoc. Google Research's DS-STAR is the canonical wiki instance: the Verifier agent "is an LLM-based judge prompted to determine if the current plan is adequate"; on reject, a Router agent chooses add-vs-fix on the plan and the loop repeats (up to 10 rounds). See patterns/planner-coder-verifier-router-loop and concepts/iterative-plan-refinement. This is load-bearing specifically for open-ended problems "that lack ground-truth labels, making it difficult to verify if an agent's reasoning is correct."
Tradeoffs / gotchas¶
- The judge has its own biases. A judge LLM can reward verbose or confident-sounding wrong answers. Rubrics must be specific enough to resist these modes.
- Judge drift. When the judge model is updated, past scores are not directly comparable. Snapshot the judge version alongside the run.
- Cost. Every replay requires both the candidate and the judge; a batch eval is 2× the inference cost.
- Not a safety proof. Judging "did the response look good" is not the same as "was the recommended action safe". Mutating actions still need guardrails outside the judge loop.
Seen in¶
-
sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-library — Production-monitoring-against-concept-drift face with conservative pass/fail/unknown ternary, in an Entity Resolution pipeline at 17M+ asset scale. Third major LLM-as-judge face on the wiki: not eval-harness for agent iteration (systems/storex) and not judge-as-labeler for ranking-data (systems/dash-relevance-ranker) but the continuous-evaluation substrate against concept drift in production, with the judge model deliberately conservative. Cited at two altitudes in the source: (a) MLflow-anchored: "We implemented a comprehensive evaluation strategy using 'LLM as a Judge' alongside manual labeling sessions. MLflow capabilities allowed us to constantly evaluate model performance to prevent concept drift."; (b) inline-the-CSAF-ETL: "To keep this pipeline reliable at scale, we use an LLM as a Judge approach to continuously score the quality of our own LLM outputs. Instead of relying only on fully labeled ground truth — which is often missing or ambiguous in real-world CPS data — we let a dedicated judge model review another model's response and decide whether it looks acceptable. The judge's job is simple and conservative: mark each result as pass, looks correct, fail, looks wrong, or unknown, not enough information." Three structural moves: (1) the pass/fail/unknown ternary is explicit — not a continuous score and not a binary — so ambiguous cases route to "unknown" rather than being forced into pass-or-fail (especially load-bearing when ground-truth is missing or ambiguous); (2) judge outputs persist in Delta tables so historical judge calls are queryable for drift detection and version comparison; (3) custom MLflow GenAI judges run structured evaluations giving "a consistent way to monitor quality, compare versions, and catch regressions across many LLM use cases — without building a bespoke evaluation stack for every new workflow." Composes with patterns/llm-judge-as-inline-pipeline-stage (synchronous inline-the-pipeline shape) and patterns/hybrid-classical-er-plus-genai (Entity Resolution at scale). Canonical wiki instance: systems/claroty-cps-library.
-
sources/2025-12-03-databricks-ai-agent-debug-databases — Databricks' systems/storex uses a snapshot-replay harness scored by a judge LLM as its primary regression signal during prompt / tool iteration, referencing MLflow 3's
judgesdocumentation. - sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm — Instacart's LACE (LLM-Assisted Chatbot Evaluation) canonicalises three production judge-design moves: (1) binary True/False scoring beats graded 1-10 on alignment + prompt- engineering cost (session-level scores aggregate from per- criterion binaries); (2) three evaluation engines compared — direct prompting / agentic reflection / agentic debate — debate wins for context-dependent + simple criteria (Customer + Support + Judge sub-agents, parallel critics with no cross-talk); (3) decouple reasoning from structured-output formatting — strong reasoner (o1-preview at writing time) produces free-form rationale, separate step emits JSON; escapes restricted- decoding quality loss. Five-dimension rubric (Query Understanding / Answer Correctness / Chat Efficiency / Client Satisfaction / Compliance) with three-tier complexity grouping (simple / context-dependent / subjective); "near-perfect accuracy" on simple criteria, >90% accuracy on context-dependent (business-model knowledge embedded in static prompt template; RAG-retrieval is future work); subjective criteria de-prioritised ("low-ROI" to refine, keep as directional check only, fix the chatbot rather than the judge). Bootstrap + regression via human-LACE alignment loop — same structural shape as Dropbox Dash human-calibrated labelling, different object (judge-criteria-prompt refinement vs. judge-as-labeller). Production deployment: stratified topic sampling → dashboards → direct experimentation-platform feedback loop. Parallel-play sibling to Lyft AI localisation and Zalando search AI-as-judge — three Tier-2/3 companies shipping the same LLM-as-judge → dashboard → experimentation architecture to different customer-facing surfaces at the same time.
-
sources/2026-04-07-datadog-bits-ai-sre-eval-platform — Datadog's Bits AI SRE evaluation platform uses judge-style scoring in two places: (a) as the rubric-based scorer over agent trajectories (trajectory evaluation, pass@k, final-answer correctness across tens of thousands of scenarios); (b) as an up-funnel confidence scorer on generated labels themselves — the patterns/agent-assisted-label-validation pipeline passes candidate ground-truth RCAs through confidence scoring across thoroughness / specificity / accuracy dimensions, flagging sub-threshold labels for human review. Alignment studies with human judges gated the trust transition; label-validation time dropped >95% once this was operational.
-
sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash — Dropbox Dash's LLM-as-judge iteration arc is a canonical "retrieval-relevance judge" case. Named four-step disagreement-reduction sequence, each step cumulatively lowering judge-vs-human disagreement: (1) baseline judge prompt — ~8% disagreement. (2) prompt refinement — "provide explanations for what you're doing" (classic chain-of-thought framing) — disagreement lower. (3) stronger model — upgraded to OpenAI's o3 reasoning model — lower still. (4) RAG as a judge — let the judge fetch work-context for domain-specific acronyms / terminology it wasn't trained on ("What is RAG?" might mean something internal to Dropbox that the model hasn't seen) — lower still. (5) DSPy layered on top via the patterns/prompt-optimizer-flywheel (bullet-pointed disagreements → optimizer → reduced-disagreement prompt) — "quite pleased with some of the results." LLM-as-judge explicitly called out as a prime DSPy target because judges have "crystal clear rubrics and evals… you know exactly what the outcome should be." Also: Dash uses NDCG scored against judge / human labels as the retrieval quality metric.
-
sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search — companion to the Clemm transcript, covering the LLM-as-labeler side: LLM judges generate hundreds of thousands to millions of relevance labels used to train Dash's XGBoost ranker (systems/dash-relevance-ranker). Adds (a) Mean Squared Error on the 1–5 graded scale (range 0–16) as the named judge-vs-human agreement metric — small disagreements get small penalty, wide disagreements get quadratically larger penalty; (b) the human-calibrated LLM labeling pipeline (patterns/human-calibrated-llm-labeling) — humans label a seed set, calibrate the judge, the judge amplifies ~100×; (c) patterns/behavior-discrepancy-sampling — sample for human review the cases where user behaviour disagrees with the LLM's judgment (clicks on low-rated results / skips on high-rated results); (d) patterns/judge-query-context-tooling — the judge is given retrieval tools to research query context before scoring (canonical example: "diet sprite" = internal performance-management tool, not a drink), generalising concepts/rag-as-a-judge from single-retrieval to tool-using agent. DSPy named as prompt-optimisation framework minimising MSE. Explicit framing: "LLMs make it possible to apply human judgment consistently and at scale, rather than replacing it."
-
sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy — model-adaptation edition: same Dash relevance judge retargeted across
o3/gpt-oss-120b/gemma-3-12bvia DSPy. Introduces two quality axes as first-class peers for an LLM judge: (a) alignment — NMSE on a 1–5 scale rescaled to 0–100 (cut 8.83 → 4.86, −45% ongpt-oss-120b); (b) operational reliability — JSON validity rate (malformed-JSON cut 42% → <1.1% ongemma-3-12b, a 97%+ reduction, with malformed counted as fully incorrect). DSPy's GEPA optimiser drives alignment via feedback strings; DSPy's MIPROv2 optimiser drives both axes simultaneously on small models. Adds three usage regimes by risk tolerance: patterns/cross-model-prompt-adaptation (full rewrite, cheap targets), patterns/instruction-library-prompt-composition (constrained bullet-selection, productiono3), and the fixed-model patterns/prompt-optimizer-flywheel. Formalises overfitting guardrails (invariants live in the feedback string). Cycle time: 1–2 weeks manual → 1–2 days with DSPy. Label coverage: 10–100× at same cost on cheaper targets. -
sources/2025-11-06-google-ds-star-versatile-data-science-agent — plan-sufficiency edition: the judge sits inside the agent's generation loop, not in an eval harness. Google Research's DS-STAR uses a Verifier LLM to score whether the current plan is adequate after each Coder execution; the loop extends or repairs the plan (via a separate Router agent) on reject and exits on approval or at 10 rounds. The primitive is load-bearing because open-ended data-science problems "lack ground-truth labels, making it difficult to verify if an agent's reasoning is correct" — the judge is the only feasible structural check. Empirical round-count distribution on DABStep: 3.0 avg rounds on easy tasks (>50 % single-round) vs 5.6 on hard tasks; state-of-the-art accuracy (45.2 % on DABStep vs 41.0 % prior best). This extends the wiki's LLM-as-judge framing along a new axis — in the loop that generates the thing being judged, co-evolving with the plan rather than scoring a fixed artefact. Canonical realisation is patterns/planner-coder-verifier-router-loop; loop-level concept is concepts/iterative-plan-refinement; termination discipline is concepts/refinement-round-budget.
-
sources/2026-04-20-cloudflare-orchestrating-ai-code-review-at-scale — coordinator judge-pass edition: the judge is the coordinator agent itself consolidating output from N specialised sub-reviewers. After the seven sub-reviewers return structured XML findings, the coordinator (running on the top model tier — Opus 4.7 / GPT-5.4) performs a judge pass that: (1) deduplicates findings surfaced by multiple reviewers; (2) re-categorises — a perf bug flagged by code-quality moves to the perf section; (3) applies a reasonableness filter dropping speculative issues, nitpicks, false positives, convention-contradicted findings; (4) when uncertain, uses its tools to read the source code and verify before emitting. Distinguishes from prior LLM-as-judge instances because the judge is (a) wrapped inside a production pipeline with GitLab actions as its downstream effects, (b) scored in production by how often humans invoke
break glass(0.6% of MRs), and (c) responsible for writing a natural-language verdict that ships as the single GitLab review comment. Canonical wiki instance of the coordinator- pattern variant of LLM-as-judge (see patterns/coordinator-sub-reviewer-orchestration). -
sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms — structured-extraction production instance. Instacart PARSE runs LLM-as-judge in two places: (a) development-mode auto-eval on a small sample alongside human labelers to let prompt authors iterate between human-eval cycles without blocking on them; (b) production-mode drift detection on periodic random samples of live extractions (paired with human auditors) to catch quality regressions on newly onboarded products. Orthogonal to the per-extraction self-verification confidence score — the two cover different failure modes.
-
sources/2026-02-19-lyft-scaling-localization-with-ai — machine-translation in-loop judge instance. Lyft's AI localization pipeline uses the Evaluator agent as an in-loop judge gating each translated string: grades 3 Drafter-produced candidates on a 4-dim rubric (accuracy/clarity, fluency/ adaptation, brand alignment, technical correctness), picks best on any-pass or feeds per-candidate critique back to the Drafter on all-fail (up to 3 attempts). Canonical on-wiki articulation of the four architecture-determining reasons to separate generator from judge instance: easier eval, context preservation on the generator side, [[concepts/self- approval-bias|self-approval bias]] avoidance, tier- differentiated model choice. Canonical text-translation sibling of the image (PIXEL), agent-plan (DS-STAR), and structured- extraction (PARSE) in-loop-judge instances.
-
sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge — search-relevance pre-launch-validation instance. Zalando's Search & Browse team ships Search Quality Framework: an offline patterns/llm-as-judge-for-search-quality pipeline that scores
(query, product)pairs on a graded 0–4 relevance rubric using GPT-4o with product data + images as visual-text context (concepts/visual-text-relevance-judgment). Applied to pre-launch market validation for Zalando's 2025 Luxembourg / Portugal / Greece launches — a setting where click-based quality signals are by definition unavailable. The test set is constructed by NER-clustered sampling from existing-market production traffic with LLM-translation to the target language; segment-level aggregates (patterns/segment-level-relevance-dashboard) surface three named failure classes (incorrect product attributes; unrecognised NER terms; undiscoverable brand categories). Cost: ~$250 per full run — 1,500 segments × 25 results × 3 markets; 3–5 hours. Judge's generalised reasoning (no per-attribute prompts) contrasts with Netflix Synopsis Judge's per-criterion judges — search relevance is a one-dimensional axis where Netflix's creative-quality axes are four-dimensional. Canonical wiki instance of LLM-as-judge in the search-relevance domain with visual-text context. -
sources/2026-02-26-instacart-our-early-journey-to-transform-discovery-recommendations-with-llms — multi-level-rubric variant, extending the per-trajectory multi-dimensional rubric into a three-level hierarchy matching the artefact's own structure. Instacart's generative recommendations platform (Shopping Hub rebuild) runs LLM-as-judge at page / placement / product levels, each with its own rubric (see patterns/llm-as-judge-multi-level-rubric). Key architectural insight from this instance: "LLM-as-a-judge evaluators are a powerful tool… it failed at the edges. Since evaluating millions of candidates is cost-prohibitive, LLMs are unable to take action and improve quality at scale." That cost-at-scale limitation motivates the companion [[patterns/fine-tuned-cross-encoder-as-filter|fine-tuned DeBERTa cross-encoder]] at >99% cheaper than LLM inference, running on every candidate rather than a sample — making evaluation become action. Canonical wiki instance of the action-vs-measurement tension in LLM- as-judge at full-catalog scale.
-
sources/2026-05-11-databricks-unlocking-the-archives — inline-pipeline-stage edition. The Databricks-for-Good groundwater-archive pipeline embeds the judge as a first-class pipeline stage gating every document classification produced by the upstream multimodal classifier. Same primitive (
ai_query) for both producer and judge — different models, same SQL surface. Rubric: accuracy / completeness / consistency "checking whether the classifications are supported by what the model actually observed." Output: categorical rating (excellent / good / fair / poor) plus a written justification — the justification column doubles as audit trail. Sub-threshold documents route to manual review; "in the first full run, only a small fraction of classifications required human attention." First-run result: 95% rated excellent/good across 5,570 pages / 654 documents. Distinguishes from prior wiki instances by being a production runtime gate, not an offline / CI eval — every output of every run flows through the judge before reaching downstream consumers (MapAid's WellMapr groundwater prediction models).
Related¶
- patterns/snapshot-replay-agent-evaluation — the replay harness that judges plug into.
- systems/mlflow — hosts the
judgesprimitive at Databricks. - systems/storex — production consumer of the pattern.
- systems/bits-ai-sre-eval-platform — Datadog's productionised eval platform applying judge-style scoring at both ends of the pipeline.
- patterns/agent-assisted-label-validation — up-funnel judge application: scoring labels, not just outputs.
- concepts/trajectory-evaluation — the scoring shape for agent traces, not just final answers.
- concepts/pass-at-k — complementary metric for quantifying non-determinism.
- concepts/nmse-normalized-mean-squared-error — specific alignment metric used at Dash for DSPy optimisation.
- concepts/structured-output-reliability — co-equal operational axis to alignment when the judge is consumed programmatically.
- patterns/cross-model-prompt-adaptation — retarget a judge across different models via DSPy.
- patterns/instruction-library-prompt-composition — constrained DSPy optimisation for high-stakes production judges.
- patterns/tool-decoupled-agent-framework — the iteration loop that makes judge scoring valuable.
- patterns/planner-coder-verifier-router-loop — the plan-sufficiency variant: judge sits inside the generation loop, gating add-or-fix plan refinement.
- concepts/iterative-plan-refinement — loop-level framing the plan-sufficiency judge plugs into.
- concepts/refinement-round-budget — termination discipline for judge-gated loops.
- systems/ds-star — canonical in-loop-judgment instance.
- concepts/vlm-as-image-judge — multimodal sibling — same "one model scores another model's output against a rubric" structure applied to images. Instacart PIXEL's 20% → 85% human-judge approval rate uplift is the canonical production instance.
- patterns/vlm-evaluator-quality-gate — the image-output sibling of judge-gated refinement.
- systems/instacart-pixel — canonical image-generation instance of the judge-in-loop pattern.
- systems/instacart-parse — canonical structured-extraction instance: LLM-as-judge runs in both development mode (auto-eval to unblock prompt iteration) and production mode (periodic random-sample drift detection) alongside human auditors. (Source: sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms)
- concepts/llm-self-verification — the self-judge special case: the extracting model scores its own output via entailment-prompt + yes-token logit.
- patterns/human-in-the-loop-quality-sampling — the periodic-sample drift-detection loop inside which LLM-as-judge multiplies reviewer coverage.
- patterns/drafter-evaluator-refinement-loop — the text-translation in-loop-judge sibling pattern (Lyft's AI localization pipeline).
- patterns/multi-candidate-generation — the N-candidates subroutine the Evaluator selects over.
- concepts/self-approval-bias — the failure mode the generator-vs-judge separation mitigates.
- concepts/machine-translation-with-llms — the task context in which Lyft's in-loop judge operates.
- systems/lyft-ai-localization-pipeline — canonical text-translation production instance.
- systems/lace-instacart — chatbot-evaluation production instance: multi-agent debate + binary rubric + reasoning/output decouple + human-alignment loop.
- patterns/multi-agent-debate-evaluation — Customer + Support + Judge three-agent evaluation engine (Instacart LACE).
- patterns/self-reflection-llm-evaluation — the two-pass single-agent variant Instacart benchmarked before picking debate.
- patterns/human-aligned-criteria-refinement-loop — bootstrap + regression-test pattern for LLM-as-judge rubrics.
- concepts/binary-vs-graded-llm-scoring — the rubric-shape choice Instacart measured in favour of binary.
- concepts/decouple-reasoning-from-structured-output — the free-form-reasoning → formatter two-pass design.
- concepts/llm-evaluation-dimensions — rubric-dimension design (5 dimensions + 3 complexity tiers at LACE).
- concepts/human-llm-evaluation-alignment — the calibration concept behind the refinement loop.
- concepts/stratified-topic-sampling — the production-monitoring sampling strategy paired with judges in long-tailed traffic.
- sources/2026-04-21-meta-modernizing-facebook-groups-search — Meta Groups Scoped Search — Llama 3 multimodal judge in the BVT pipeline. Canonical wiki instance of the LLM-judge-in-BVT pattern: Meta integrates a Llama 3 multimodal judge directly into the build-verification-test CI stage, grading search results on a graded "exact-match / somewhat-relevant / irrelevant" rubric before any build advances to production. First Meta-authored LLM-judge-in-CI instance on the wiki; stronger operational stance than offline-eval-leaderboard variants (Zalando / Instacart) — judge verdict is a build-gate, not only a training signal. Motivation: "validate quality at scale without the bottleneck of human labeling."