PATTERN Cited by 1 source
Translated-query NER-parity Check¶
Intent¶
Validate that translated search queries preserve intent across languages by running the NER engine against both the source and translated queries and diffing the extracted tag sets. Non-matching tag sets flag NER-vocabulary / lemmatisation / ambiguity issues in the target language that would otherwise silently degrade search quality.
The pattern is a diagnostic sidecar to LLM-as-judge for search quality — the judge measures relevance collapse; the parity check localises the cause to NER specifically.
(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)
Structure¶
Source query: "Kids Winter Jacket"
NER tags: {category: kids, type: jacket, season: winter}
LLM-translate to PT:
Translated: "Jaqueta de Inverno Infantil"
NER tags: {category: ?, type: jaqueta, season: ?}
▲
│
└─ MISSING tags → flag
Parity diff report:
category: kids → (missing in PT) — NER vocabulary gap
season: winter → (missing in PT) — NER vocabulary gap
type: jacket → type: jaqueta — present (translated)
Where it lives in the pipeline¶
As an Airflow task in parallel with the LLM-judge relevance-scoring stage:
TaskGroup: market=PT
├─ generate test queries
├─ retrieve search results
├─ LLM judge (relevance scores) ◄─ primary signal
├─ NER analyser (parity check) ◄─ diagnostic signal
└─ report (joined)
Zalando names it directly: "During execution, each search query is processed by the NER engine to extract its NER tag attributes. This allows us to compare the NER tags of the original search query and the translated search query, and identify inconsistencies that can lead to search issues, such as missing tags or incorrectly tagged attributes in the new language."
Why it's a separate signal from relevance¶
Low relevance alone is ambiguous — it could be NER, it could be catalogue data, it could be ranker behaviour. Tag-parity violation isolates the NER cause specifically:
- Low relevance + tag parity holds → investigate ranker or catalogue.
- Low relevance + tag parity violated → investigate NER vocabulary / lemmatisation / disambiguation.
- High relevance + tag parity violated → unusual but possible (different tags happen to retrieve equivalent results); still worth fixing to prevent future regressions.
Disclosed example shapes (Portuguese)¶
| Violation shape | Example | Remediation |
|---|---|---|
| Lemmatisation drift | "desporto", "desportivo", "desportiva" → distinct tags | Update target-lang lemmatiser |
| Ambiguous collision | "tenis" (sneaker) vs tennis | Disambiguation / context-aware tagging |
| Missing vocabulary | "menina", "meninas" | Add to NER dictionary |
| Multi-word term | "fato de treino" → 3 isolated tokens | Extend multi-word-entity recognition |
Pre-requisites¶
- Source-language NER is trusted. Parity comparison assumes the source tag set is correct. Validates before running the check.
- Translation preserves intent at scenario level. LLM translation working on NER-clustered scenarios is what makes intent-preservation plausible.
- NER engine runs on both languages. The engine must support target-language processing at minimum; the parity check reveals what it doesn't yet cover.
Limitations¶
- Tag mismatch doesn't fully diagnose. The check flags that NER disagrees; it doesn't tell you which tag is wrong — source could be right and target wrong, or vice versa, or both wrong in different ways.
- Under-tagging of source is silent. If the source-language NER already misses an attribute, the parity check can't see the absence.
- Noisy translations propagate as tag drift. A bad LLM translation may produce parity failures that are really translation failures, not NER failures.
Seen in¶
- sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge — canonical wiki instance; Zalando runs an NER-analyser Airflow task in parallel with the LLM-judge evaluation.