Skip to content

CONCEPT Cited by 1 source

NER-tag Parity Across Languages

Definition

NER-tag parity across languages is the operational realisation of the translated-query-parity invariant: run the same NER engine against both the source-language query and its target-language translation, and diff the extracted NER-tag sets. Matching tag sets = translation preserved intent and target-language NER covered it. Diverging tag sets = at least one of the two is broken.

(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

The sidecar task

In Zalando's pipeline this is a distinct Airflow task — an "NER analyzer task in the Airflow DAG" — that runs in parallel with relevance scoring rather than as a prerequisite. Its output is a per-scenario tag-diff report, consumed as a first-class diagnostic signal alongside the LLM-judge segment-level relevance aggregates.

Separating the parity check from the relevance check is deliberate: low relevance with tag parity = ranker / product- data issue; low relevance with tag mismatch = NER-vocabulary issue.

Disclosed violation shapes

Shape Example (PT) Effect
Lemmatisation drift "desporto", "desportivo", "desportiva" all different tags Inconsistent filters across paraphrase queries
Ambiguous-term collision "tenis" / "ténis" (sneaker) vs tennis the sport Term unrecognised; sport-shoes scenario degrades
Missing vocabulary "menina", "meninas" (girl) Mixed-gender result sets
Multi-word term unrecognised "fato de treino" (tracksuit) Zero sport/tracksuit results

The remediation path

The parity signal points at which NER operation is incomplete — not directly at the fix, but at the layer:

  • Lemmatisation drift → update lemmatisation rules / stemmer dictionary for the target language.
  • Ambiguous-term collision → add disambiguation logic / context- aware tagging.
  • Missing vocabulary → "determines whether to index missing terms for searchability" — the terms need to be added to both the NER dictionary and the searchable catalogue.
  • Multi-word term unrecognised → multi-word-entity recognition needs extending for the target language.

Complementary to (not redundant with) relevance scoring

Both the NER-analyser and the LLM-as-judge are needed. A tag mismatch doesn't by itself prove relevance collapsed (the ranker might recover); a relevance collapse doesn't by itself prove NER is the cause (the catalogue might be missing products). The two signals together localise the defect.

Seen in

Last updated · 507 distilled / 1,218 read