CONCEPT Cited by 1 source

NER-tag Parity Across Languages¶

Definition¶

NER-tag parity across languages is the operational realisation of the translated-query-parity invariant: run the same NER engine against both the source-language query and its target-language translation, and diff the extracted NER-tag sets. Matching tag sets = translation preserved intent and target-language NER covered it. Diverging tag sets = at least one of the two is broken.

(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

The sidecar task¶

In Zalando's pipeline this is a distinct Airflow task — an "NER analyzer task in the Airflow DAG" — that runs in parallel with relevance scoring rather than as a prerequisite. Its output is a per-scenario tag-diff report, consumed as a first-class diagnostic signal alongside the LLM-judge segment-level relevance aggregates.

Separating the parity check from the relevance check is deliberate: low relevance with tag parity = ranker / product- data issue; low relevance with tag mismatch = NER-vocabulary issue.

Disclosed violation shapes¶

Shape	Example (PT)	Effect
Lemmatisation drift	"desporto", "desportivo", "desportiva" all different tags	Inconsistent filters across paraphrase queries
Ambiguous-term collision	"tenis" / "ténis" (sneaker) vs tennis the sport	Term unrecognised; sport-shoes scenario degrades
Missing vocabulary	"menina", "meninas" (girl)	Mixed-gender result sets
Multi-word term unrecognised	"fato de treino" (tracksuit)	Zero sport/tracksuit results

The remediation path¶

The parity signal points at which NER operation is incomplete — not directly at the fix, but at the layer:

Lemmatisation drift → update lemmatisation rules / stemmer dictionary for the target language.
Ambiguous-term collision → add disambiguation logic / context- aware tagging.
Missing vocabulary → "determines whether to index missing terms for searchability" — the terms need to be added to both the NER dictionary and the searchable catalogue.
Multi-word term unrecognised → multi-word-entity recognition needs extending for the target language.

Complementary to (not redundant with) relevance scoring¶

Both the NER-analyser and the LLM-as-judge are needed. A tag mismatch doesn't by itself prove relevance collapsed (the ranker might recover); a relevance collapse doesn't by itself prove NER is the cause (the catalogue might be missing products). The two signals together localise the defect.

Seen in¶

sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge — canonical wiki instance; Zalando runs an NER-analyser task producing cross-language tag diffs in parallel with the LLM-judge relevance evaluation.

concepts/translated-query-parity — the invariant this operationalises.
concepts/ner-clustered-query-sampling — the upstream clustering that produces the scenario identity.
systems/zalando-ner-query-builder
patterns/translated-query-ner-parity-check
companies/zalando