Skip to content

CONCEPT Cited by 1 source

Translated Query Parity

Definition

Translated query parity is the invariant that a search query translated into another language should carry the same search intent as its source — expressed concretely as: the NER-tag set extracted from a translated query should match the NER-tag set extracted from its source-language original.

"The same query should have the same NER tags by meaning across different languages." (Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

The useful part is the violation signal: when tag sets diverge, something is wrong — typically NER-vocabulary gaps, lemmatisation drift, or collision with other words in the target language.

Violation taxonomy (disclosed)

Zalando's Portuguese / Greek launches surfaced four shapes:

  1. Lemmatisation drift. Portuguese "desporto", "desportivo", "desportiva" (sport, sport-adj-masc, sport-adj-fem) should collapse to the same CATEGORY=sport tag via lemmatisation. Zalando's NER didn't lemmatise them consistently → "Queries with 'desporto', 'desportivo', 'desportiva' did not have consistent term filters due to word lemmatization issues."
  2. Ambiguous translation collision. Portuguese "tenis", "ténis" mean sneaker but collide with tennis the sport. "Term 'tenis', 'ténis' (sneaker in portuguese) could not be recognized and did not discover sport shoes in general, due to an ambiguity with sport 'tennis'." — the collision left the term effectively unrecognised.
  3. Missing vocabulary. Portuguese "menina", "meninas" (girl, girls) were not in the NER dictionary → "Term 'menina', 'meninas' could not be recognized so searching for girl articles returned mixed results from any genders and age groups."
  4. Multi-word term unrecognised. Portuguese "fato de treino" (tracksuit) is a three-word construction the NER didn't handle as one entity → "Searching for tracksuit 'fato de treino' did not show any sport or tracksuit results."

All four present as low relevance scores in the downstream LLM judge — the violation is diagnosed as an NER problem, not a relevance problem, by the NER-tag-diff sidecar.

Why it's useful as a signal

NER-vocabulary bugs are hard to find by reading production NER outputs alone — the NER returns some tag set, not obviously wrong unless compared to what it should have returned. Translated-query-parity gives the signal a reference point: what the source-language tag set was. The diff is unambiguous — "this tag should have appeared in the translated output" or "this tag appeared but shouldn't have".

Machinery that makes the signal tractable

  • Scenarios, not queries, are the unit of equivalence. The source identity is the NER-tag set on the existing-market queries; the target-language translations inherit that identity by construction.
  • Paraphrase members provide redundancy. Each scenario has multiple paraphrase queries in both languages; partial-match (some paraphrases recover, some don't) surfaces which specific word form broke lemmatisation.
  • LLM translation preserves the intent, not the syntax. A professional-translation or rule-based approach would be brittle; LLM translation keeps intent stable across paraphrases within a scenario. See patterns/translated-query-ner-parity-check.

Limitation

Parity is necessary but not sufficient. Even a tag-parity-passing translation can produce low-relevance results if:

  • The product catalogue for the new market is sparse on the scenario (disclosed failure class 3: undiscoverable products).
  • The ranker is under-trained on the new language's synonym patterns.
  • The product-data attributes in the new market are incorrect (disclosed failure class 1).

The LLM judge is still needed to detect these; translated- parity only rules out the NER-vocabulary cause.

Seen in

Last updated · 507 distilled / 1,218 read