CONCEPT Cited by 1 source
Query-frequency power-law caching¶
Definition¶
Query-frequency power-law caching is the concrete infrastructure primitive that exploits a power-law distribution over queries to dramatically shrink the set of inputs you have to pay expensive-inference cost on:
- Sort queries by historical frequency.
- Pick a frequency threshold (or top-N cutoff).
- Pre-compute the expensive inference output for every query at or above the threshold (the head).
- Store results in a query-keyed cache.
- Serve live traffic from the cache; route cache-misses to a cheaper or real-time-viable fallback.
The key insight is that query frequency is Zipf-like: a small number of popular queries account for the majority of traffic, so pre-computing the head covers most live traffic while making the per-query expensive cost amortise across many hits.
Canonical wiki instance: Yelp Query Understanding¶
Canonical reference: Yelp's 2025-02-04 post (sources/2025-02-04-yelp-search-query-understanding-with-llms). Direct quote on the mechanism:
"To address this challenge, we leverage the fact that distribution of query frequencies can be estimated by the power-law. By caching (pre-computing) high-end LLM responses for only head queries above a certain frequency threshold, we can effectively cover a substantial portion of the traffic and run a quick experiment without incurring significant cost or latency. We then integrated the cached LLM responses to the existing system and performed offline and online (A/B) evaluations."
Yelp used this twice — once for query segmentation + spell correction, once for review-highlight phrase expansion. Review- highlights scaled to 95% of traffic pre-computed, with the remaining 5% served by averaging expanded-phrase CTR over business categories as a fallback heuristic. The segmentation cache's exact head-cut was not disclosed.
Why it works¶
Three properties of the Yelp (and more generally, search) query distribution compose to make the pattern economically load- bearing:
- Power law. A small head covers a large fraction of traffic.
- Output cacheable at the query level. Same query → same expensive output (for a reasonable refresh cadence).
- Short input + short output. The per-call LLM cost is small relative to e.g. document-level tasks; batch pre- computation at OpenAI-batch-API rates is economical.
Yelp's verbatim framing of the three properties:
"(1) all of these tasks can be cached at the query level, (2) the amount of text to be read and generated is relatively low, and (3) the query distribution follows the power law — a small number of queries are very popular. These features make the query understanding a very efficient ground for using LLMs."
Remove any one and the economics break: if outputs aren't cacheable (e.g. personalised output), the cache doesn't amortise. If outputs are long (e.g. code generation), the batch cost grows. If the distribution isn't power-law (e.g. uniform random inputs), head-cache coverage doesn't reach most traffic.
Relationship to adjacent concepts¶
- concepts/long-tail-query — the traffic-shape cousin. "Long-tail query" names the thing the cache can't cover; "query-frequency power-law caching" names the mechanism that exploits the head half of the same distribution.
- concepts/power-law-url-traffic — the URL-rendering analogue (Vercel's TPR for pre-rendering top-N pages). Same distribution shape, different artefact cached.
- patterns/head-cache-plus-tail-finetuned-model — the architecture pattern that uses this caching primitive as the head layer. Yelp's canonical form of the pattern — three- tier cache → offline-batch GPT-4o-mini → realtime BERT/T5.
- patterns/three-phase-llm-productionization — Yelp's Formulation → POC → Scale playbook. The POC phase runs on top of this caching primitive; at POC time the head cache
- expensive-LLM output is the full serving path.
Operational considerations¶
- Frequency threshold is a product decision. Too high: small cache, poor traffic coverage. Too low: huge cache, wasted pre- computation budget on queries that won't hit.
- Refresh cadence vs. drift. Seasonality ("Christmas tree" spikes in December), new-product launches, and business- catalog changes all drift the correct answer to head queries. Cache entries must be refreshed on a cadence appropriate to the drift rate — typically daily or sub-daily for commerce/ search workloads.
- Tail-miss path quality matters. The cache covers the head, but if the tail path ships bad outputs, the aggregate serving quality still suffers on the slice that users notice most (the tail queries are typically where user complaints concentrate — see concepts/long-tail-query).
- Cache key choice. Exact-match on normalised query string is the simplest key. More sophisticated variants use embedding-similarity (see patterns/llm-extraction-cache-by-similarity) to capture near-duplicate queries.
- Cold-start. A brand-new deployment has no frequency data; the cache has to bootstrap from query logs or from the tail model while frequency data accumulates.
Caveats¶
- The pattern is load-bearing for LLM economics, not essential. Pre-LLM systems had similar caches for expensive classical-NLP operations; LLMs re-centre the pattern because the per-call cost is higher.
- Frequency-only thresholding leaves quality improvements on the table for important but infrequent queries. A tail query that drives high revenue per hit may deserve cache entry anyway — see concepts/long-tail-query discussion of "tail complaints" as disproportionately user-facing.
- Output-level caching is coarser than prompt-prefix caching (see concepts/prompt-cache-consistency) — the two are complementary, not substitutable.
Seen in¶
- sources/2025-02-04-yelp-search-query-understanding-with-llms — canonical wiki reference; Yelp's pre-computation of head- query LLM output for segmentation + review-highlights.
- sources/2025-11-13-instacart-building-the-intent-engine — parallel later instance at Instacart (98/2 head/tail split) for query semantic-role-labeling.
Related¶
- concepts/long-tail-query — the distributional cousin
- concepts/power-law-url-traffic — the URL-rendering analogue
- concepts/query-understanding — the canonical task family
- concepts/llm-cascade — routing pattern that can sit on top of the cache
- patterns/head-cache-plus-tail-finetuned-model — the pattern that uses this concept as its head layer
- patterns/three-phase-llm-productionization — the productionisation playbook whose POC phase runs on top of this caching primitive
- systems/yelp-query-understanding — canonical production instance
- companies/yelp