PATTERN Cited by 1 source

(query, product) Evaluation Cache¶

Intent¶

Deduplicate the expensive operations of an offline (query, product) evaluation pipeline — product-data/image fetches and LLM-judge score emissions — by caching on (query, product) keys. Reduce cost from O(queries × results_per_query) to O(|unique products|) plus O(|unique (q, p) pairs|).

The cache is scoped to the evaluation pipeline, not shared with production caches.

(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

Structure¶

Evaluation task N            Evaluation task M
(query_a → products [...])   (query_b → products [...])
            ▼                            ▼
    ┌───────────────────────────────────────────┐
    │   Evaluation cache  (e.g. ElastiCache)     │
    │                                             │
    │   key = product_id                          │
    │   val = product data + image bytes          │
    │                                             │
    │   key = (query_id, product_id)              │
    │   val = LLM-judge relevance score           │
    └───────────────────────────────────────────┘
                     ▲
              cache lookup before:
                - Product API call
                - GPT-4o completion call

Two cache spaces, populated independently on miss:

Product-fetch cache. Keyed by product_id. Shared across all queries — the same product can appear under thousands of scenarios and is fetched at most once per run.
Score cache. Keyed by (query_id, product_id). Score is query-dependent, so the pair is the key. Shared across markets running in parallel TaskGroups — if two markets happen to evaluate the same (q, p) pair (rare but possible), the second one hits the cache.

Zalando's quantification¶

"Instead of calling Product API (5000 × 25) times for 5000 search queries with 25 results, we only need to call it N times where N is the number of unique products in all search results. This N does not scale as much as the number of search queries increases."

The upper bound is 125,000 product fetches; the lower bound is |distinct products|. The reduction ratio is entirely a function of how much result-set overlap exists — for a clothing catalogue, many popular items appear under many queries, so the ratio is substantial.

For the score cache: "We also store evaluation results of each (query, product) pair in the cache, so that it reuses the previously evaluated results if the same (query, product) pair appears in other search queries, which further saves time and LLM cost."

Critical design decision: scope isolation¶

The cache is "only accessible to the evaluation tasks". It does not share storage with production catalog-search caches (Catalog API, NER Query Builder, Base Search coordinator).

Two reasons:

Avoid pollution. Offline evaluation issues mass queries at patterns production wouldn't see; warming production caches from those would skew production hit-rate statistics and evict production working sets.
Preserve observability. The evaluation must see production result sets as they are. Sharing caches would mask cache-miss-driven performance regressions the evaluation is trying to detect.

Invalidation discipline¶

The cache is fill-on-miss, never-refresh-on-write. There is no explicit invalidation mechanism — staleness bounds equal re-run cadence. For a pre-launch framework running per day, yesterday's (q, p) score is still valid unless the ranker changed.

When the ranker changes such that new products appear under existing queries, those new pairs are cache misses and get fresh scores; old pairs that still appear reuse their cached scores, which is usually correct (ranker change doesn't alter product-to-query relevance per se; it alters which products appear).

Second-order benefit: cheap re-runs¶

Once the cache is populated, re-runs cost only the delta: new queries, new products, pairs that haven't been seen. This is what makes the framework viable as a regression detector on already-live markets — daily runs cost pennies relative to the initial fill cost.

Variations¶

TTL-scoped if the underlying product data or judge model is expected to drift.
Persistent across runs (Zalando's choice, implied by ElastiCache) vs per-run (simpler, loses cross-run dedup).
Cross-market shared vs per-market scoped. Zalando's ElastiCache is shared; strict isolation would require per-market cache namespaces.

Seen in¶

sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge — canonical wiki instance.