PATTERN Cited by 1 source

LLM judge in build-verification test (BVT)¶

Pattern¶

Integrate an LLM-as-judge directly into the build-verification test (BVT) / CI pipeline so every candidate build of a search, ranking, or agent system is graded by an LLM against a benchmark query + expected-answer set before it is allowed to advance toward production rollout.

The judge runs per-build, not only pre-training or on an evaluation leaderboard. Its verdict is a pass/fail (plus diversity + conceptual-match metrics) signal that gates the build.

Why this stance (vs offline leaderboard eval)¶

Traditional LLM-as-judge deployments run as offline eval harnesses adjacent to training — a model-of-the-week leaderboard. That misses:

Regressions from non-model changes — retrieval index rebuild, config rollout, query-preprocessing change — that a training-time eval never touches.
Fast shipping pressure — every merged PR to the search stack should go through quality grading, not only candidate models.

Putting the judge in the BVT extends offline eval into CI and makes quality-regression-detection a first-class build-artifact property.

Canonical instance — Meta Groups Scoped Search (2026-04-21)¶

From the 2026-04-21 Meta Engineering post:

"To validate quality at scale without the bottleneck of human labeling, we integrated an automated evaluation framework into our build verification test (BVT) process. We utilize Llama 3 with multimodal capabilities as an automated judge to grade search results against queries. Unlike binary 'good/bad' labels, our evaluation prompts are designed to detect nuance. We explicitly programmed the system to recognize a 'somewhat relevant' category, defined as cases where the query and result share a common domain or theme (e.g., different sports are still relevant in a general sports context). This allows us to measure improvements in result diversity and conceptual matching."

Key elements of this instance:

Judge model: Llama 3 with multimodal capabilities.
Rubric: graded (exact-match / somewhat-relevant / irrelevant), not binary.
Pipeline position: inside the BVT — gates build-to-production promotion.
Scale intent: "without the bottleneck of human labeling" — human grading does not scale to per-build validation at Meta's shipping cadence; the judge replaces that bottleneck.
Additional metrics: diversity + conceptual-match (enabled by the graded rubric).

Sibling patterns¶

patterns/llm-as-judge-for-search-quality (Zalando 2026-03-16) — the search-specific judge harness, but outside the build path.
patterns/llm-as-judge-multi-level-rubric (Zalando 2026-03-16) — the multi-level-rubric discipline, analogous to Meta's graded rubric.
patterns/human-calibrated-llm-labeling (Dropbox 2026-02-26) — human-calibrated alignment for the judge, complementary to CI integration.

Meta's position is the CI-integrated variant — judge as build-gate, not (only) as training signal.

Caveats¶

Judge prompt-design is load-bearing; a poorly-specified rubric produces noise that blocks valid builds or lets regressions through.
The judge is itself an LLM, subject to version drift; the judge version and prompt should be pinned and versioned alongside the code that gates on it.
Human calibration is still valuable — a periodic comparison of judge scores vs human labels keeps the judge's distribution anchored (cf patterns/human-calibrated-llm-labeling).
BVT runtime matters — the judge's latency and cost are now in the developer-feedback loop, not just the training cycle.

Seen in¶

sources/2026-04-21-meta-modernizing-facebook-groups-search