YELP 2025-02-04 Tier 3

Yelp — Search query understanding with LLMs: from ideation to production¶

Summary¶

Yelp Engineering post (2025-02-04) — the canonical first-party disclosure of how Yelp productionised LLMs for search query understanding tasks (segmentation, spell correction, review- highlight phrase expansion). Positions query understanding as "the pioneering project" for LLMs at Yelp and "the most refined" LLM application since. The structural argument: three properties make query understanding "a very efficient ground for using LLMs" — (1) results are cacheable at the query level, (2) input/output text is short, (3) query distribution follows the power law so a small head of queries covers most traffic. These jointly legitimise running an expensive model offline on the head and serving the live path from cache — the same head-cache-plus-tail shape later canonicalised on the wiki via Instacart's Intent Engine. Yelp's generic three-phase process across both running examples: (1) Formulation — prototype with the most powerful available LLM (GPT-4), iterate the prompt and task definition, decide whether to fuse related tasks into one prompt (spell-correction + segmentation merged; review- highlight kept separate). (2) Proof of Concept — pre-compute head-query responses with the expensive model into a cache, integrate the cache into the existing pipeline, run offline evaluations against human-labeled datasets and online A/B tests. (3) Scaling Up — iterate prompt on the expensive model → generate a large golden dataset → curate/correct → fine-tune a smaller model (GPT-4o-mini) offline for ~100× cost reduction → pre-compute at tens-of-millions-of-queries scale → optionally fine-tune BERT/T5 for realtime tail serving. RAG is the recurring prompt enrichment lever: inject names of businesses that have been viewed for that query into the segmentation prompt to disambiguate business names from topics/locations; inject the top business categories from an in-house predictive model into the review-highlight prompt to ground the phrase expansion. Reported operational wins: segmentation-based implicit location rewrite ships online A/B wins (e.g. "Epcot restaurants" rewrites the search-backend geobox from "Orlando, FL" to "Epcot, Bay Lake, FL" — a 30-mile-radius refinement); token-probability of the name tag improves business-name matching; review-highlight CTR lifts (Session/Search CTR) reported, "higher for less common queries in the tail"; GPT-3 → GPT-4 upgrade gave "Search CTR improvement on top of previous gains"; review-highlight scaled to 95% of traffic via offline OpenAI batch API calls, with the remaining 5% served by averaging expanded-phrase CTR signal over business categories as a fallback heuristic. Reported scaling numbers: fine-tuned GPT-4o-mini = "up to a 100x savings in cost" vs. direct GPT-4 prompt at equivalent quality for query-understanding tasks. BERT + T5 are Yelp's current realtime tail-query models. Haystack talk referenced for deeper detail.

Key takeaways¶

Query understanding's three properties make it the ideal LLM beachhead. Yelp's structural argument verbatim: "all of these tasks can be cached at the query level, (2) the amount of text to be read and generated is relatively low, and (3) the query distribution follows the power law — a small number of queries are very popular. These features make the query understanding a very efficient ground for using LLMs." The three properties together compose: cheap per-call cost × small traffic slice that needs real inference × reusable outputs = the LLM bill is bounded by the head- query set size, not by overall query volume. This is the infrastructure preamble that makes the rest of the post's shape legible — it's not "LLMs for search", it's "LLMs for the subset of search where the economics work". (Source: sources/2025-02-04-yelp-search-query-understanding-with-llms)
Query segmentation fused with spell correction in one prompt because a strong LLM can do both simultaneously. Segmentation output uses six labels — topic, name, location, time, question, none — with a meta-tag for spell-corrected segments: "healthy fod near me → {topic} healthy food {location} near me [spell corrected - high]". The decision to fuse: "spell-correction and segmentation can be done together by a sufficiently powerful model, so we added a meta tag to mark spell corrected sections and decided to combine these two tasks into a single prompt." This is a canonical instance of the LLM-era consolidation: a pipeline that used to be two cascaded pre-LLM models collapses into one LLM call when the model's capability exceeds either individual task's need. Legacy subclasses of "topic" were also collapsed because "this would have required the LLM to understand intricate details of our internal taxonomy that are both unintuitive and subject to change" — the LLM output schema should match downstream consumers, not internal training-data taxonomies. (Source: sources/2025-02-04-yelp-search-query-understanding-with-llms)
RAG side-inputs disambiguate the extraction. Two worked examples. Segmentation: augment the query text with "names of businesses that have been viewed for that query" — so "buon cuon [Banh Cuon Tay Ho, Phuong Nga Banh Cuon]" lets the LLM spell-correct "buon cuon" → "banh cuon" and also recognise it as a topic rather than a name. Review highlights: augment the query with the "most relevant business categories with respect to that query (from our in-house predictive model)" — so ambiguous queries like "pool" (swimming vs. billiards) get the right expansion universe. The RAG context is not retrieved documents in the usual sense — it's side-channel signals from Yelp's existing ML stack piped into the LLM prompt as grounding context. This generalises to patterns/rag-side-input-for-structured-extraction: the input to the LLM is (query, auxiliary_structured_signal) not just (query,). (Source: sources/2025-02-04-yelp-search-query-understanding-with-llms)
The three-phase Formulation → POC → Scale process is the reusable playbook. Phase 1 (Formulation): prototype with the strongest available LLM (GPT-4), iterate prompts aggressively, decide on output schema + task fusion + RAG signals. Phase 2 (POC): exploit power-law caching to cover most traffic with pre-computed expensive-LLM outputs cheaply; run offline eval + online A/B at that cache coverage. Phase 3 (Scale): if the A/B wins, (a) keep iterating the strong-LLM prompt; (b) generate a golden dataset by running the strong prompt on a representative sample; (c) improve the golden dataset by targeting likely-mislabeled inputs for re-labeling or removal; (d) fine-tune GPT-4o-mini on the curated dataset and run offline at tens-of-millions scale for ~100× cost reduction; (e) optionally fine- tune an even smaller realtime model (BERT, T5) for the long-tail queries never seen in the pre-computed cache. Canonicalised as patterns/three-phase-llm-productionization — the Yelp shape of the generic LLM-productionisation lifecycle. (Source: sources/2025-02-04-yelp-search-query-understanding-with-llms)
Power-law query caching is the cost-control primitive. "By caching (pre-computing) high-end LLM responses for only head queries above a certain frequency threshold, we can effectively cover a substantial portion of the traffic and run a quick experiment without incurring significant cost or latency." Yelp's version: set a frequency threshold, pre- compute expensive LLM output for all queries above it, store the results in a query-level cache, serve live traffic from the cache. This is functionally identical to Instacart's later 98/2 head/ tail split (see Intent Engine) but documented in 2025-02 — Yelp is the wiki's earliest canonical disclosure of this pattern for query-understanding workloads. Review highlights later scaled to 95% of traffic pre-computed, 5% fallback heuristic; exact head-cut threshold not disclosed but operationally similar to the Instacart 98/2. (Source: sources/2025-02-04-yelp-search-query-understanding-with-llms)
Fine-tuned GPT-4o-mini is the serving-cost lever at scale. Verbatim: "Because fine-tuned query understanding models only require very short inputs and outputs, we have seen up to a 100x savings in cost, compared to using a complex GPT-4 prompt directly." The model is trained on the expensive-LLM-generated golden dataset (post-curation). This is a textbook offline-teacher-online-student distillation instance — the teacher (GPT-4) labels the training set, the student (GPT-4o-mini) serves production at two orders of magnitude less cost. The curation-before-training step is load-bearing: "With hard work here, it can be possible (for many tasks) to improve upon GPT-4's raw output. Try to isolate sets of inputs that are likely to have been mislabeled and target these for human re-labeling or removal." The student's quality ceiling is "the curated teacher", not "the raw teacher". (Source: sources/2025-02-04-yelp-search-query-understanding-with-llms)
BERT/T5 serve the realtime long tail at Yelp today; the realtime-OpenAI path is coming. Verbatim: "at Yelp, we have used BERT and T5 to serve as our real time LLM model. These models are optimized for speed and efficiency, allowing us to process user queries rapidly and accurately during the complete rollout phase. As the cost and latency of LLMs improve, as seen with GPT4o-mini and smaller prompts, realtime calls for OpenAI's fine-tuned model might also be achievable in the near future." This is the three-tier cascade in its fullest form: (1) cache (head) → (2) offline fine-tuned GPT-4o-mini for 95%+ coverage → (3) realtime BERT/T5 for never-seen-before tail queries. Functionally equivalent to Instacart's two-tier cache+Llama-3-8B-LoRA but with an extra classical-NLP tier at the bottom of the stack. Canonical wiki instance of a three-tier model cascade — orthogonal to the confidence-based LLM cascade because the Yelp tiers route on cache-hit / cache-miss / never-seen, not on confidence. (Source: sources/2025-02-04-yelp-search-query-understanding-with-llms)
Token-probability of segmentation tags becomes a ranking signal. Not just the discrete segmentation output — "we were able to (a) leverage token probabilities for (name) tags to improve our query to business name matching and ranking system". This is a canonical token-probability-as- ranking-signal instance: the LLM's output distribution is retained past the discrete decision, and downstream ranking treats the per-token name-tag probability as a continuous feature (higher probability → model is more confident this segment is a business name → downstream ranker can weight name-matching more heavily). Same trick as Instacart's self-verification entailment- token logit, but used for ranking instead of review- routing. (Source: sources/2025-02-04-yelp-search-query-understanding-with-llms)
Implicit query location rewrite is the canonical downstream-transform example. The post gives the exact mechanism verbatim: "leverages query segmentation to implicitly rewrite text within location boxes to a refined location within 30 miles of the user's search if we have high confidence in the location intent. For example, the segmentation 'epcot restaurants => {location} epcot {topic} restaurant' helps us to understand the user's intent in finding businesses within the Epcot theme park at Walt Disney World. By implicitly rewriting the location text from 'Orlando, FL' to 'Epcot' in the search backend, our geolocation system was able to narrow down the search geobox to the relevant latlong." Three rewrite examples canonical from the post's table: "Restaurants near Chase Center" (SF → 1 Warriors Way SF 94158); "Ramen Upper West Side" (NY → Upper West Side Manhattan NY); "Epcot restaurants" (Orlando FL → Epcot Bay Lake FL). This is a specific concept — concepts/implicit-query-location-rewrite — and also a canonical consumer of segmentation output. (Source: sources/2025-02-04-yelp-search-query-understanding-with-llms)
Review-highlight phrase expansion is a creative/generative task, not an extraction task. Distinct from segmentation in kind: segmentation tags "healthy food" from the query; phrase expansion generates new phrases that match the concept — "healthy options, nutritious, organic, low calorie, low carb, low fat, high fiber, fresh, plant-based, superfood" — which are then used to find interesting review snippets in real reviews. The post documents the prompt evolution timeline — May 2022 → March 2023 → September 2023 → December 2023 (with RAG) — as a monotonic expansion of the curated example phrases: from 3 expansions → 5 expansions → 11 expansions (structured into tiers of aggressiveness) → 11+RAG-grounded. The expansion set includes multi-word and casual phrases ("watch the game") and navigates the semantic tree both up ("vegan burritos" → "vegan", "vegan options") and out ("seafood" → "fresh fish", "fresh catch", "salmon roe", "shrimp"). Canonicalised as phrase expansion for review highlights. (Source: sources/2025-02-04-yelp-search-query-understanding-with-llms)

Architecture components (named by the post)¶

The post describes three generic process-phases applied to two running examples (Query Segmentation and Review Highlights):

Phase	What happens	Models involved	Outputs
1. Formulation	Prototype prompts with the strongest available LLM. Define output schema. Decide on RAG side-inputs. Decide task fusion (e.g. merge spell-correction + segmentation).	Stable GPT-4 (or o1-mini / o1-preview for tasks needing logical reasoning)	Prompt + schema
2. Proof of Concept	Pre-compute expensive-LLM responses for head queries above a frequency threshold. Integrate cache into existing system. Run offline (human-labeled) + online (A/B) evals.	Expensive LLM (GPT-4) + head-query cache	A/B metrics
3. Scaling Up	(a) Iterate GPT-4 prompt on worst-performing tracked queries. (b) Generate golden dataset. (c) Improve golden dataset (target mislabeled inputs for re-labeling / removal). (d) Fine-tune GPT-4o-mini → ~100× cost reduction. Run offline at tens-of-millions scale. (e) Optionally fine-tune BERT/T5 for realtime tail serving.	Expensive LLM (training-data gen) + fine-tuned GPT-4o-mini (offline batch) + BERT/T5 (realtime)	Cached responses at scale + realtime-serving model

Query Segmentation — concrete instantiation¶

Labels: topic, name, location, time, question, none.
Meta tags: [spell corrected - high] marks a spell- corrected segment at high confidence.
Task fusion: spell-correction + segmentation in one prompt.
RAG side-input: business names viewed for that query.
Downstream consumers: implicit location rewrite (geobox refinement); token-probability of name tag used by ranker.

Review Highlights — concrete instantiation¶

Task shape: generate a creatively-expanded list of phrases that could match review snippets for the given query.
No task fusion with segmentation (kept separate).
RAG side-input: top business categories from an in-house predictive model.
Scale state: 95% of traffic served from pre-computed cache; 5% served by averaging expanded-phrase CTR over business categories.
Cross-task reuse: CTR signals on expanded phrases are fed back into downstream ranking.

Operational numbers disclosed¶

Fine-tuned GPT-4o-mini: "up to a 100x savings in cost, compared to using a complex GPT-4 prompt directly" — canonical teacher-to-student cost-reduction datum at query-understanding altitude.
Review-highlight rollout: "scaled to 95% of traffic by pre-computing snippet expansions for those queries using OpenAI's batch calls". Remaining 5% served by averaging expanded-phrase CTR over business categories.
Location rewrite: "refined location within 30 miles of the user's search if we have high confidence in the location intent" — the confidence-gated rewrite radius.
Review-highlight A/B wins: "increased Session / Search CTR across our platforms"; "impact was also higher for less common queries in the tail". Exact deltas not disclosed.
GPT-3 → GPT-4 upgrade: "improved Search CTR on top of previous gains". Exact delta not disclosed.
Segmentation output schema: 6 labels (topic, name, location, time, question, none) + meta-tags for spell correction.
Review-highlight prompt example evolution: 3 phrases (May 2022) → 5 (March 2023) → 11 (September 2023) → 11+RAG (December 2023).
Models named: GPT-4 (formulation + training-data gen), o1-mini / o1-preview (complex reasoning tasks), GPT-4o-mini (fine-tuned offline batch), BERT + T5 (realtime tail serving).

Caveats¶

No numeric head/tail split. Unlike Instacart's 98/2 disclosure, Yelp doesn't name the exact frequency cutoff or traffic-coverage percentage for the segmentation cache. The review-highlights 95% datum is the only concrete coverage number.
No offline evaluation metrics disclosed for segmentation. The post says segmentation was compared "against the status quo system on human labeled datasets of name match and location intent" but doesn't share accuracy / precision / recall numbers.
Review-highlight quality evaluation is subjective. "Offline evaluation of the quality of generated phrases is subjective and requires very strong human annotators with good product, qualitative, and engineering understanding." This is an explicit limitation: the post can't say "+X% F1" the way an extraction task could.
Confidence-score mechanism for location-rewrite gate not disclosed. The "high confidence in the location intent" gate is mentioned but not specified — is it the token- probability of the location tag? A threshold on segmentation- model output? A separate classifier? The post doesn't say.
Fine-tuning dataset size not disclosed. The sample that becomes the golden dataset is described as "large (but not unmanageably so, since quality > quantity) and it should cover a diverse distribution of inputs" — no specific N.
BERT/T5 serving latency / p95 not disclosed. The realtime models are named but their production latency profile is not reported.
Tail-query quality gap to the cached head not quantified. Tail queries hit BERT/T5 in realtime (vs. GPT-4o-mini / GPT-4 output from cache for the head) — the quality differential is not measured, only claimed: "these models are optimized for speed and efficiency, allowing us to process user queries rapidly and accurately".
Prompt-caching dimension not mentioned. The post doesn't discuss LLM-provider-side prompt caching (the prompt-cache concept canonicalised from Vercel v0) — Yelp's caching is output- level (entire LLM response per query) rather than prompt-prefix-level.
No A/B treatment on the realtime BERT/T5 tail model itself — the post's A/B wins are all measured at head-cache coverage. Whether the BERT/T5 model contributes its own lift (vs. no tail model at all) is not stated.

Source¶

systems/yelp-query-understanding — the named system
systems/yelp-search — parent production context
systems/gpt-4 / systems/gpt-4o-mini / systems/o1-preview / systems/o1-mini / systems/bert / systems/t5 — the model zoo
concepts/query-understanding — upstream parent concept
concepts/long-tail-query — the traffic shape that forces the hybrid architecture
concepts/query-frequency-power-law-caching — the caching substrate that makes LLM-at-search-scale economically viable
concepts/implicit-query-location-rewrite — the canonical downstream-consumer of segmentation output
concepts/review-highlight-phrase-expansion — the creative-generation task family
concepts/token-probability-as-ranking-signal — the logit- reuse pattern for downstream ranking
concepts/llm-segmentation-over-ner — why LLMs supplant traditional NER for segmentation
concepts/retrieval-augmented-generation — with two RAG side-input instances (business names + business categories)
concepts/llm-cascade — related cost-vs-quality routing pattern, thoroughly instantiated in the Scaling-Up phase
patterns/three-phase-llm-productionization — the Formulation → POC → Scale playbook
patterns/head-cache-plus-tail-finetuned-model — the serving architecture shape
patterns/offline-teacher-online-student-distillation — the training-pipeline shape for GPT-4 → GPT-4o-mini fine-tune
patterns/rag-side-input-for-structured-extraction — the RAG-as-disambiguator pattern (business names; business categories)
companies/yelp