CONCEPT Cited by 1 source

Visual-text Relevance Judgment¶

Definition¶

Visual-text relevance judgment is the LLM-as-judge shape that scores (query, product) relevance using both product data and product images as evaluation context — the judge is multi-modal (visual-text) and scores against a generalised rubric rather than per-attribute prompts.

"Our LLM-as-a-judge uses product data and product images for its evaluation context (visual-text). It generalises well across different languages and different search contexts, e.g. by searching 'Kids Winter Jacket', the model should give high relevance scores to search results with jacket products of any brands, any colours, etc. from kids categories, according to the product attributes or their images should score. Search results that are just long-sleeve shirts, or adult items, should score lower. The reasoning is generalised and does not require specific prompts to instruct the LLM to look for specific attributes or specific parts of images." (Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

The axis it's defined against¶

In the LLM-as-judge design space, three relevant design choices:

Input modality. Text-only vs multi-modal (text + image / video / audio).
Rubric specificity. Per-attribute prompts ("check the neckline", "check the colour", "check the size") vs generalised rubric ("rate 0–4 by overall relevance").
Output scale. Binary vs graded.

Visual-text relevance judgment uses:

Multi-modal input. Product images + product data.
Generalised rubric. Clear 0–4 scale; no per-attribute specialisation.
Graded output. 0–4 per result item.

Why the grounding-via-image works¶

Product images are the authoritative depiction of what the user will see on the PLP/PDP. Description attributes and category tags can lag or be inconsistent across sellers; the image is what ends up on screen. A visual-text judge scoring "does this image match the query intent" is therefore checking the proximate representation of relevance, not a text-derivable proxy.

This lets the judge reason about features that are not surfaced as NER-extractable attributes — fit, style, visual similarity to the query's implicit aesthetic — that would otherwise require structured attribute-level rubrics to evaluate.

What "generalised reasoning" means¶

The claim is load-bearing: a single rubric prompt scores relevance regardless of scenario, category, brand, or language. The judge isn't told "for 'Kids Winter Jacket' queries, check that the result is a jacket, check the season tag is winter, check the age-group is kids". Instead, the query + 0–4 rubric are all the instruction; the judge infers what features matter from the query itself.

Tradeoff: accuracy on edge cases is untunable per-attribute. If fashion-specific neckline variants consistently confuse the judge (a known GPT-4o weakness on fine- grained fashion vocabulary — see sources/2024-09-17-zalando-content-creation-copilot-ai-assisted-product-onboarding), the generalised rubric has no obvious knob to fix it short of per-criterion judges à la Netflix Synopsis Judge.

Model used¶

Production judge is GPT-4o during Zalando's 2025 pre-market-launch process. Multi-modal inputs (image tokens + text tokens) dominate cost per call. The post quotes ~$250 per full run with GPT-4o completions as the dominant cost driver.

Contrast with per-criterion judges¶

Netflix Synopsis Judge uses dedicated per-criterion judges (tone / clarity / precision / factuality) because single-prompt multi-criterion judging "overloaded the LLM". Zalando's single-rubric approach works in search-relevance partly because the criterion is relevance — a 0–4 Likert on one axis, not four orthogonal quality dimensions. The domains select different judge shapes.

Seen in¶

sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge — canonical wiki instance; GPT-4o visual-text judge over (query, product) pairs in Zalando's pre-launch search QA.

concepts/llm-as-judge — parent.
concepts/multi-modal-attribute-extraction — the Zalando catalogue-attribute sibling that uses VLMs for attribute extraction from product images, not relevance judgment.
systems/gpt-4o — the canonical multi-modal model.
systems/zalando-search-quality-framework
patterns/llm-as-judge-for-search-quality
companies/zalando