Skip to content

PATTERN Cited by 1 source

LLM-as-Judge for Search Quality

Intent

Evaluate a search stack's relevance quality without relying on user click signal by having a multi-modal LLM judge score every (query, returned product) pair in a representative test set on a graded rubric, and aggregating the scores up to a segment- / market- / stack-level quality report.

The pattern replaces (or augments) click-based bucket tests and human-annotator panels as the short-loop signal on relevance regression and pre-launch quality.

(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

Structure

   test queries (sampled, clustered, translated)
   ┌───────────────────────┐
   │  Query → Search API    │──────► result set (top-K products)
   └───────────────────────┘
   for each (query, product):
       fetch product data + image  (cached)
       ask LLM judge:
          "Score relevance on 0–4 scale"
       record score                (cached)
   aggregate by segment (NER-tag set) → per-segment avg
   aggregate by market               → per-market avg

Core design choices

  • Multi-modal judge. Product image + product data as evaluation context. See concepts/visual-text-relevance-judgment.
  • Graded rubric output (0–4), not binary. 4 = perfect match / 0 = completely wrong; intermediate grades capture partial relevance. Graded output lets segment aggregates be continuous rather than pass-rate-only.
  • Generalised prompt, not per-attribute rubrics. "The reasoning is generalised and does not require specific prompts to instruct the LLM to look for specific attributes or specific parts of images."
  • (query, product) dedup cache. See patterns/query-product-evaluation-cache.
  • Segment-level aggregation. See patterns/segment-level-relevance-dashboard.

Zalando's concrete realisation

When to reach for this pattern

  • Pre-launch new-market / new-locale evaluation. No click signal available; LLM-judge is the structural substitute. See concepts/pre-launch-market-validation.
  • Regression detection on low-traffic segments. Click statistics too noisy; offline LLM-judge run cheap enough to re-run daily.
  • Counterfactual ranker evaluation. Before rolling out a new ranker, run the judge against both rankers' result sets for the same test queries.

When not to

  • High-frequency, behaviourally-specific optimisation. If you have click / dwell / CTR at statistical significance, those are stronger signals than any judge.
  • Subjective or multi-dimensional quality axes that don't collapse to one relevance scale — use per-criterion judges (à la Netflix Synopsis Judge) instead.
  • Latency-critical online scoring. The pattern is offline; a re-ranker-at-request-time needs different economics.

Cost / quality knobs

  • Which judge model. GPT-4o in 2026 is the reference; the tier below (multi-modal, smaller context) trades quality for $. Cost sensitivity is the primary upgrade driver.
  • Dedup cache scope. Per-run vs persistent.
  • Consensus / sampling. Zalando's description suggests single-shot judgment per pair; consensus scoring (patterns/consensus-scoring) would improve stability at N× cost — not described here.
  • Per-criterion decomposition. Zalando uses one rubric for relevance; Netflix splits their judging into four criteria. Depends on whether your quality definition is one-dimensional.

Tradeoffs with human annotation

Zalando's framing: "Especially so when considering the alternative of human evaluation, which also would take days." The LLM-judge is faster and cheaper; the post cites the referenced paper (arXiv:2409.11860) for human-calibration numbers. The quality-vs-cost tradeoff vs human annotation is not quoted in this blog post but is the load-bearing assumption of the whole pattern's economic case.

Relation to other LLM-as-judge deployments

  • Netflix Synopsis Judge — creative-quality domain (tone / clarity / precision / factuality), per-criterion specialisation, binary outputs. Different domain, different judge shape.
  • Dropbox Dash relevance judge — retrieval relevance with graded scale + NMSE alignment + DSPy (GEPA/MIPROv2) optimisation. Closest sibling to Zalando's case on the relevance axis.
  • Instacart PIXEL VLM image judge — in-loop image judge driving iterative prompt refinement (20% → 85% approval). Closest on the visual-text multi-modal axis but inside a generation loop, not an offline QA gate.
  • Lyft AI localization Evaluator — drafter + evaluator multi-dim rubric. Same shape as Zalando's post-translation NER-parity sidecar at a different altitude.

Seen in

Last updated · 507 distilled / 1,218 read