PATTERN Cited by 1 source
Segment-level Relevance Dashboard¶
Intent¶
Present LLM-judge relevance scores to engineers at segment-level aggregate granularity (NER-tag set, market, brand, category) rather than per-query, so that ranked low-scoring segments function as a triage worklist against known failure classes.
Per-query scores are noisy and not actionable; per-segment aggregates are stable and map cleanly onto root causes. This is the engineer-facing half of LLM-as-judge for search quality.
(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)
Structure¶
Per-(query, product) scores → group by NER-tag set
→ avg score per segment
Segment table (ranked ascending):
┌──────────────────────────────┬──────────────┐
│ Segment (NER-tag group) │ Avg relevance │
├──────────────────────────────┼──────────────┤
│ CATEGORY=fato de treino │ 1.2 / 4.0 │
│ CATEGORY=desporto │ 1.5 / 4.0 │
│ GENDER=mulher CATEGORY=menina │ 2.0 / 4.0 │
│ CATEGORY=zapatilhas │ 2.4 / 4.0 │
│ ... │ ... │
└──────────────────────────────┴──────────────┘
Sibling views (same data, different projections):
• ranked by brand → brand-data-quality issues
• ranked by category → catalog coverage issues
• NER-tag-diff table → target-language NER gaps
What the engineer does with it¶
Zalando's enumeration of the three named failure classes (see concepts/segment-level-root-cause-diagnosis) is effectively a dashboard-to-remediation mapping:
| Dashboard pattern | Inferred root cause | Remediation path |
|---|---|---|
| Multiple similar-meaning segments all low | Incorrect product attributes / data | Fix product-data feed for the affected category |
| Target-language segment low + NER-tag diff vs source | Unrecognised terms in target NER | Update target-language NER dictionary, lemmatiser, or multi-word recogniser |
| Multiple brand-scoped segments all low | Undiscoverable brand catalog | Audit brand-side data quality; contact merchant |
Disclosed examples¶
Zalando's post quotes two dashboard snapshots verbatim:
Portuguese market before launch — four segments with diagnostic cause notes:
| Segment | Avg relevance | Observed cause |
|---|---|---|
CATEGORY=desporto |
1.5 / 4.0 | Lemmatisation drift on sport terms |
CATEGORY=zapatilhas |
2.4 / 4.0 | "tenis" / "ténis" collision with sport tennis |
GENDER=mulher CATEGORY=menina |
2.0 / 4.0 | "menina", "meninas" unrecognised |
CATEGORY=fato de treino |
1.2 / 4.0 | Multi-word tracksuit term unrecognised |
Brand-wide issue surfaced — five BRAND=foo segments all
scoring 1.5–1.9 / 4.0, flagging brand-data quality as the
probable cause.
Design principles¶
- Aggregation primacy. The per-item scores feed one-way into segment aggregates. Engineers don't read raw scores.
- Multiple projections. The same underlying scores are re-aggregated by brand, by category, by target language — each projection surfaces a different failure class.
- Low-first ranking. Dashboards sort ascending on relevance so the worklist is already ordered by worst-first.
- Paired with NER-tag-diff table. The parity-check output lives next to the relevance table; two signals are required to localise root cause.
Trade-offs¶
- Segment granularity is a tuning knob. Too-coarse aggregates hide distinct failures; too-fine aggregates re-introduce noise.
- Threshold discipline is undisclosed. What counts as "low enough to act on"? The post doesn't quote a gate threshold. Teams adopting this pattern will need to calibrate empirically.
- Single-axis diagnosis assumption. When multiple failure classes overlap in one segment (bad NER + bad product data + bad ranker for the brand), the dashboard surfaces the symptom, not the verdict.
Seen in¶
- sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge — canonical wiki instance; Zalando's dashboard + the three-failure-class enumeration.