Skip to content

PATTERN Cited by 1 source

Segment-level Relevance Dashboard

Intent

Present LLM-judge relevance scores to engineers at segment-level aggregate granularity (NER-tag set, market, brand, category) rather than per-query, so that ranked low-scoring segments function as a triage worklist against known failure classes.

Per-query scores are noisy and not actionable; per-segment aggregates are stable and map cleanly onto root causes. This is the engineer-facing half of LLM-as-judge for search quality.

(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

Structure

Per-(query, product) scores  →  group by NER-tag set
                                  → avg score per segment

Segment table (ranked ascending):
┌──────────────────────────────┬──────────────┐
│ Segment (NER-tag group)       │ Avg relevance │
├──────────────────────────────┼──────────────┤
│ CATEGORY=fato de treino       │ 1.2 / 4.0    │
│ CATEGORY=desporto             │ 1.5 / 4.0    │
│ GENDER=mulher CATEGORY=menina │ 2.0 / 4.0    │
│ CATEGORY=zapatilhas           │ 2.4 / 4.0    │
│ ...                           │ ...          │
└──────────────────────────────┴──────────────┘

Sibling views (same data, different projections):
  • ranked by brand        → brand-data-quality issues
  • ranked by category     → catalog coverage issues
  • NER-tag-diff table     → target-language NER gaps

What the engineer does with it

Zalando's enumeration of the three named failure classes (see concepts/segment-level-root-cause-diagnosis) is effectively a dashboard-to-remediation mapping:

Dashboard pattern Inferred root cause Remediation path
Multiple similar-meaning segments all low Incorrect product attributes / data Fix product-data feed for the affected category
Target-language segment low + NER-tag diff vs source Unrecognised terms in target NER Update target-language NER dictionary, lemmatiser, or multi-word recogniser
Multiple brand-scoped segments all low Undiscoverable brand catalog Audit brand-side data quality; contact merchant

Disclosed examples

Zalando's post quotes two dashboard snapshots verbatim:

Portuguese market before launch — four segments with diagnostic cause notes:

Segment Avg relevance Observed cause
CATEGORY=desporto 1.5 / 4.0 Lemmatisation drift on sport terms
CATEGORY=zapatilhas 2.4 / 4.0 "tenis" / "ténis" collision with sport tennis
GENDER=mulher CATEGORY=menina 2.0 / 4.0 "menina", "meninas" unrecognised
CATEGORY=fato de treino 1.2 / 4.0 Multi-word tracksuit term unrecognised

Brand-wide issue surfaced — five BRAND=foo segments all scoring 1.5–1.9 / 4.0, flagging brand-data quality as the probable cause.

Design principles

  • Aggregation primacy. The per-item scores feed one-way into segment aggregates. Engineers don't read raw scores.
  • Multiple projections. The same underlying scores are re-aggregated by brand, by category, by target language — each projection surfaces a different failure class.
  • Low-first ranking. Dashboards sort ascending on relevance so the worklist is already ordered by worst-first.
  • Paired with NER-tag-diff table. The parity-check output lives next to the relevance table; two signals are required to localise root cause.

Trade-offs

  • Segment granularity is a tuning knob. Too-coarse aggregates hide distinct failures; too-fine aggregates re-introduce noise.
  • Threshold discipline is undisclosed. What counts as "low enough to act on"? The post doesn't quote a gate threshold. Teams adopting this pattern will need to calibrate empirically.
  • Single-axis diagnosis assumption. When multiple failure classes overlap in one segment (bad NER + bad product data + bad ranker for the brand), the dashboard surfaces the symptom, not the verdict.

Seen in

Last updated · 507 distilled / 1,218 read