title: "Somewhat-relevant" evaluation category type: concept created: 2026-04-26 updated: 2026-04-26 tags: [search, evaluation, relevance-grading, llm-as-judge, llm-judge, graded-relevance, rubric, semantic-search] sources: [2026-04-21-meta-modernizing-facebook-groups-search] related: [concepts/llm-as-judge, concepts/relevance-labeling, concepts/binary-vs-graded-llm-scoring, patterns/llm-judge-in-build-verification-test, patterns/llm-as-judge-multi-level-rubric, systems/meta-groups-scoped-search]
"Somewhat-relevant" evaluation category¶
Definition¶
A "somewhat relevant" evaluation category is the distinctive middle-tier rubric label used in LLM-as-judge grading of semantic search results, covering the case where a query and a retrieved document share a common domain or theme but do not have an exact-term or exact-intent match.
Meta's canonical definition from the 2026-04-21 Meta Engineering post:
"Unlike binary 'good/bad' labels, our evaluation prompts are designed to detect nuance. We explicitly programmed the system to recognize a 'somewhat relevant' category, defined as cases where the query and result share a common domain or theme (e.g., different sports are still relevant in a general sports context). This allows us to measure improvements in result diversity and conceptual matching."
Why the middle tier matters¶
Binary "good/bad" rubrics collapse signal that a graded rubric preserves:
- A query for "swimming lane etiquette" that returns a post about water polo is not an exact match — but it's not "bad" either. Both are about pool behaviour; the swimmer probably gets value.
- Treated as "bad" in a binary rubric, such results count against the retrieval system and the MTML ranker will learn to suppress them.
- Treated as "somewhat relevant" in a graded rubric, they count as positive signal for diversity and conceptual matching — which is what a hybrid retrieval system is supposed to enable.
The move from keyword search to hybrid retrieval creates this rubric need: semantic retrieval specifically surfaces off-exact-match candidates, and the evaluation rubric has to reward that behaviour to avoid penalising the new arm.
Relation to graded-relevance traditions¶
Graded-relevance rubrics are not new — the IR literature has long used rubrics like highly relevant / relevant / marginally relevant / irrelevant (e.g. TREC). The novelty in the Meta post is not the concept of graded relevance itself but its explicit encoding into the LLM-judge prompt for a production search system, as a first-class primitive. See concepts/binary-vs-graded-llm-scoring for the broader debate; see patterns/llm-as-judge-multi-level-rubric for the Zalando instance.
Operational consequences¶
- Diversity metrics become measurable. "How often does the system return results that share a common domain or theme with the query?" is a well-defined question once "somewhat relevant" exists.
- Ranker training signal is less punishing. Off-exact but on-theme candidates aren't labelled as negatives.
- LLM-judge prompt complexity rises. The judge must reason about "domain or theme" overlap, not just exact match — which is exactly why multimodal Llama 3 is chosen for the judge role.