Skip to content

CONCEPT Cited by 1 source

(query, product) Evaluation Cache

Definition

The (query, product) evaluation cache is the deduplication pattern scoped to an offline evaluation pipeline: cache every (query_id, product_id) pair's product fetch and LLM-judge score so a given pair is evaluated at most once per run, regardless of how many queries it appears under.

The naive computation is O(queries × results_per_query) expensive operations. The cached computation is O(|unique products|) — which for a shared catalogue scales much more gently as query volume grows.

(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

The cost collapse

Zalando quotes the arithmetic directly:

"The search results of different search queries may share the same products and letting the LLM judge to retrieve the same product data and images multiple times would be inefficient and slow. Therefore we put a shared cache (Elasticache) only accessible to the evaluation tasks to store and re-use the product data. This saves time and cost for the evaluation significantly. Instead of calling Product API (5000 x 25) times for 5000 search queries with 25 results, we only need to call it N times where N is the number of unique products in all search results. This N does not scale as much as the number of search queries increases. We also store evaluation results of each (query, product) pair in the cache, so that it reuses the previously evaluated results if the same (query, product) pair appears in other search queries, which further saves time and LLM cost." (Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

Two independent cache layers:

  1. Product-fetch cache keyed by product_id — collapses duplicate product-API calls.
  2. Score cache keyed by (query_id, product_id) — collapses duplicate GPT-4o calls when the same query is being evaluated twice or when cached runs replay.

Why "(query, product)" and not just "product"

Relevance is query-dependent. A product that is highly relevant to "Kids Winter Jacket" may be irrelevant to "Nike Sneakers" — so the score cache must be keyed by the pair, not just the product. The product-fetch cache can be keyed by product alone, since the product's data / images are query- independent.

Evaluation-scope isolation

The cache is "only accessible to the evaluation tasks" — not shared with production catalog-search caches. This matters for two reasons:

  • The evaluation must not warm or pollute production caches. Production search has its own cache layers (Catalog API, NER Query Builder, Base Search coordinator); an offline evaluation running mass queries against production should not skew their hit-rate statistics or evict their working sets.
  • The evaluation must see the production result set as it is. If the framework shared caches with production, it could be reading cached results that production would also return — defensible — but more importantly could mask performance regressions that only show up on cache misses.

Second-order benefit: re-run cost collapse

Because the cache is persistent, re-runs of the same evaluation are near-free on the LLM side. Zalando's repeatability claim — "we can re-evaluate our search quality as many times as we want" — is economic, not just operational.

This makes the framework usable as a regression detector on already-live markets: the marginal cost of a daily re-run is just the new (query, product) pairs.

Failure mode

Cache invalidation remains the classical hazard. If the ranker changes such that the same query now returns different products, the old (query, product) entries still apply — but the new product set needs fresh evaluation. The pattern is cache-fill-on-miss, not cache-refresh-on-write, which means staleness = time since last full run, bounded by re-evaluation cadence.

Seen in

Last updated · 507 distilled / 1,218 read