PATTERN Cited by 1 source

Segment-level Relevance Dashboard¶

Intent¶

Present LLM-judge relevance scores to engineers at segment-level aggregate granularity (NER-tag set, market, brand, category) rather than per-query, so that ranked low-scoring segments function as a triage worklist against known failure classes.

Per-query scores are noisy and not actionable; per-segment aggregates are stable and map cleanly onto root causes. This is the engineer-facing half of LLM-as-judge for search quality.

(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

Structure¶

Per-(query, product) scores  →  group by NER-tag set
                                  → avg score per segment

Segment table (ranked ascending):
┌──────────────────────────────┬──────────────┐
│ Segment (NER-tag group)       │ Avg relevance │
├──────────────────────────────┼──────────────┤
│ CATEGORY=fato de treino       │ 1.2 / 4.0    │
│ CATEGORY=desporto             │ 1.5 / 4.0    │
│ GENDER=mulher CATEGORY=menina │ 2.0 / 4.0    │
│ CATEGORY=zapatilhas           │ 2.4 / 4.0    │
│ ...                           │ ...          │
└──────────────────────────────┴──────────────┘

Sibling views (same data, different projections):
  • ranked by brand        → brand-data-quality issues
  • ranked by category     → catalog coverage issues
  • NER-tag-diff table     → target-language NER gaps

What the engineer does with it¶

Zalando's enumeration of the three named failure classes (see concepts/segment-level-root-cause-diagnosis) is effectively a dashboard-to-remediation mapping:

Dashboard pattern	Inferred root cause	Remediation path
Multiple similar-meaning segments all low	Incorrect product attributes / data	Fix product-data feed for the affected category
Target-language segment low + NER-tag diff vs source	Unrecognised terms in target NER	Update target-language NER dictionary, lemmatiser, or multi-word recogniser
Multiple brand-scoped segments all low	Undiscoverable brand catalog	Audit brand-side data quality; contact merchant

Disclosed examples¶

Zalando's post quotes two dashboard snapshots verbatim:

Portuguese market before launch — four segments with diagnostic cause notes:

Segment	Avg relevance	Observed cause
`CATEGORY=desporto`	1.5 / 4.0	Lemmatisation drift on sport terms
`CATEGORY=zapatilhas`	2.4 / 4.0	`"tenis"` / `"ténis"` collision with sport tennis
`GENDER=mulher CATEGORY=menina`	2.0 / 4.0	`"menina"`, `"meninas"` unrecognised
`CATEGORY=fato de treino`	1.2 / 4.0	Multi-word tracksuit term unrecognised

Brand-wide issue surfaced — five BRAND=foo segments all scoring 1.5–1.9 / 4.0, flagging brand-data quality as the probable cause.

Design principles¶

Aggregation primacy. The per-item scores feed one-way into segment aggregates. Engineers don't read raw scores.
Multiple projections. The same underlying scores are re-aggregated by brand, by category, by target language — each projection surfaces a different failure class.
Low-first ranking. Dashboards sort ascending on relevance so the worklist is already ordered by worst-first.
Paired with NER-tag-diff table. The parity-check output lives next to the relevance table; two signals are required to localise root cause.

Trade-offs¶

Segment granularity is a tuning knob. Too-coarse aggregates hide distinct failures; too-fine aggregates re-introduce noise.
Threshold discipline is undisclosed. What counts as "low enough to act on"? The post doesn't quote a gate threshold. Teams adopting this pattern will need to calibrate empirically.
Single-axis diagnosis assumption. When multiple failure classes overlap in one segment (bad NER + bad product data + bad ranker for the brand), the dashboard surfaces the symptom, not the verdict.

Seen in¶

sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge — canonical wiki instance; Zalando's dashboard + the three-failure-class enumeration.