CONCEPT Cited by 1 source

Segment-level Root-cause Diagnosis¶

Definition¶

Segment-level root-cause diagnosis is the observability discipline of aggregating noisy per-item scores (e.g. per-query / per-result relevance) up to a segment — a coherent group of items sharing a diagnostic axis — and reading the aggregate signal as a ranked shortlist of probable root causes.

Per-query scores are noisy; per-segment aggregates are stable. The diagnostic axis (in Zalando's case, the NER-tag set) is deliberately chosen so that different segment-level patterns map to different root causes.

(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

The failure classes segment patterns surface¶

Zalando's post enumerates three:

1. Incorrect product attributes / data¶

"Product categories with incorrect attributes have difficulty surfacing in search results despite different query variations. Multiple NER tag segments with similar meaning but consistently low relevance scores indicate this issue."

Pattern: several NER-tag segments sharing a category or attribute family all score low together. The common factor is the product-side attribute set, so the diagnostic jumps to "the product data is wrong, not the NER or the ranker".

2. Unrecognised terms / attributes by NER¶

"The evaluation pipeline processes NER tagging (NER analyzer task in the Airflow DAG) to identify unrecognized terms. This helps validate spell correction and lemmatization in new languages, and determines whether to index missing terms for searchability."

Pattern: translated queries produce missing or incorrect NER tags vs their source-language equivalents — the translated-query-parity violation. Four PT / GR / Spanish-derived examples disclosed in the source: "desporto" (lemmatisation drift), "ténis" (sneaker/tennis collision), "menina" (missing vocabulary), "fato de treino" (multi-word term unrecognised).

3. Undiscoverable products / categories¶

"This helps us identify if a brand, a product family, or a category is not discoverable by analyzing multiple search segments that share the same product tags from NER."

Pattern: several segments sharing a brand tag (e.g. BRAND=foo across CATEGORY=yoga, CATEGORY=leggings, GENDER=mulher CATEGORY=tops, CATEGORY=fato de treino, CATEGORY=jackets MATERIAL=nylon) all score low. The common factor is the brand's product data — "the product data may have quality issues, e.g. missing or wrong attributes, which leads to the issue that these products are less discoverable by search."

Zalando's disclosed example:

Segment	Avg relevance
`BRAND=foo CATEGORY=yoga`	1.8 / 4.0
`BRAND=foo CATEGORY=leggings`	1.6 / 4.0
`BRAND=foo GENDER=mulher CATEGORY=tops`	1.9 / 4.0
`BRAND=foo CATEGORY=fato de treino`	1.7 / 4.0
`BRAND=foo CATEGORY=jackets MATERIAL=nylon`	1.5 / 4.0

All five BRAND=foo segments score below 2 — a clean "investigate this brand's catalog data" signal.

Why the segmentation axis matters¶

The power of segment-level diagnosis comes from choosing a segmentation axis whose shared-failures correspond to shared-causes. NER-tag sets work because each failure class above has a clean projection onto tag-set structure:

Same category/attribute family across segments → product- attribute class.
Same translation source with tag-set diff → NER-vocabulary class.
Same brand tag across categories → brand-data class.

A badly-chosen axis would scatter each failure class across segments, reducing the aggregate signal back to noise.

Generalises beyond search relevance¶

The pattern recurs in any noisy-per-item, structured-label-space evaluation:

Observability metrics aggregated by endpoint / client / region / shard — per-request noise, per-segment signal.
Model-quality monitoring aggregated by input-feature bucket (demographic, device, time) — per-prediction noise, per-bucket drift signal.
A/B test readouts aggregated by segment — individual user noise, per-segment effect.

Limitation¶

Segments must be disjoint enough to assign causes cleanly. If a single segment simultaneously has bad product data and bad NER coverage and a brand issue, the aggregate score is low for all three reasons; the engineer still has to cross-reference. Segment-level diagnosis is a shortlist mechanism, not a verdict mechanism.

Seen in¶

sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge — canonical wiki instance; Zalando surfaces three named failure classes from NER-tag-segmented LLM-judge scores.

concepts/llm-as-judge
concepts/ner-clustered-query-sampling — provides the axis.
concepts/visual-text-relevance-judgment — provides the per-item score.
systems/zalando-search-quality-framework
patterns/segment-level-relevance-dashboard
companies/zalando