SYSTEM Cited by 1 source

Yelp Query Understanding¶

Definition¶

Yelp Query Understanding is the LLM-powered query- processing pipeline that sits between the raw user query and Yelp Search's retrieval + ranking backend. It is the system canonicalised by the 2025-02-04 Yelp Engineering post "Search query understanding with LLMs: from ideation to production" (sources/2025-02-04-yelp-search-query-understanding-with-llms).

Tasks covered (as of 2025-02-04)¶

Query segmentation — assign labels from {topic, name, location, time, question, none} to each token-run of the query. Used downstream for name-matching, location rewrite, and filter auto-enablement.
Spell correction — fused with segmentation into a single LLM prompt; spell-corrected segments are meta-tagged with [spell corrected - high].
Review-highlight phrase expansion — creative generation of semantically-adjacent phrases so interesting review- snippets can be surfaced alongside each business.
Canonicalisation — mentioned in passing in the post's preamble; not detailed.

Architectural components (2025-02-04 snapshot)¶

                 raw query text
                       │
                       ▼
        ┌──────────────────────────────┐
        │   RAG side-input assembly    │
        │   (seg:  businesses viewed)  │
        │   (rh:   business categories)│
        └──────────┬───────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────────┐
│  Three-tier cascade (cache → batch → realtime)   │
│                                                   │
│   ┌───────────────┐                               │
│   │  head cache   │ ◀── pre-computed by GPT-4     │
│   │  (expensive   │     or fine-tuned GPT-4o-mini │
│   │   LLM output) │     via OpenAI batch API      │
│   └──────┬────────┘                               │
│          │ hit                                    │
│     miss │                                        │
│          ▼                                        │
│   ┌───────────────┐                               │
│   │  fine-tuned   │ ◀── offline batch coverage    │
│   │  GPT-4o-mini  │     (95%+ for review highl.)  │
│   └──────┬────────┘                               │
│          │ hit                                    │
│     miss │                                        │
│          ▼                                        │
│   ┌───────────────┐                               │
│   │  BERT / T5    │ ◀── realtime long-tail        │
│   │  (realtime)   │     serving                    │
│   └──────┬────────┘                               │
└──────────┼────────────────────────────────────────┘
           │
           ▼
        segmentation   phrase expansion   …
           │                 │
           ▼                 ▼
     ┌───────────┐     ┌───────────────┐
     │  Yelp     │     │  Review       │
     │  Search   │     │  highlighting │
     │  backend  │     │  sub-system   │
     └───────────┘     └───────────────┘

Notes per the post:

Cache layer: "caching (pre-computing) high-end LLM responses for only head queries above a certain frequency threshold". Generalised as concepts/query-frequency-power-law-caching.
Batch layer: fine-tuned GPT-4o-mini runs offline via OpenAI batch API calls; review-highlight expansion scaled to 95% of traffic via this path.
Realtime layer: BERT + T5 models serve the 5% tail that never hits the cache or batch path.
RAG side-inputs differ per task: segmentation uses "names of businesses that have been viewed for that query"; review-highlight uses "the most relevant business categories with respect to that query (from our in-house predictive model)".

Downstream consumers¶

Implicit location rewrite — when segmentation tags a {location} with high confidence, the search-backend geobox is rewritten within 30 miles of the user's search to the refined location. Canonical example: "epcot restaurants" → rewrite geobox from "Orlando, FL" to "Epcot, Bay Lake, FL". See concepts/implicit-query-location-rewrite.
Business-name matching — the token probability of the {name} tag is used as a continuous feature in Yelp's query-to-business-name matching + ranking system.
Auto-enabled filters — not fully detailed in the post, but listed as a downstream benefit of "more intelligent labeling of these tags."
Review-snippet selection — phrase expansion output feeds the review-highlight sub-system, which picks review passages matching the expanded phrases for display alongside search results.

Build history (per the post's three-phase process)¶

Yelp's canonical three-phase lifecycle — see patterns/three-phase-llm-productionization — applied to both running examples:

Formulation (GPT-4): decide output schema, merge tasks where possible (segmentation + spell correction fused), decide RAG side-inputs.
Proof of Concept (head-cache with expensive-LLM output): pre-compute for head queries, wire up cache, offline + online evals. Review-highlight A/B "increased Session / Search CTR across our platforms"; location-rewrite "achieved online metric wins".
Scaling Up: fine-tune GPT-4o-mini on the GPT-4-generated
curated golden dataset ("up to a 100x savings in cost"); pre-compute at tens-of-millions scale; deploy BERT/T5 for realtime tail.

Tradeoffs / gotchas¶

Cache freshness is tied to query-distribution drift. Head queries can change seasonally ("Christmas tree", "superbowl party"); the cache needs a refresh cadence appropriate to the drift rate. The post doesn't disclose Yelp's exact cadence.
Tail-query quality gap. BERT/T5 realtime output is implicitly lower quality than the offline GPT-4o-mini output; the gap is asserted but not measured in the post.
Fine-tuned student inherits teacher's biases. The curation step ("isolate sets of inputs that are likely to have been mislabeled and target these for human re-labeling or removal") mitigates but doesn't eliminate this.
RAG side-input pipeline is a separate dependency. The "businesses viewed for that query" and "predicted categories" are Yelp's own in-house signals; a user without a comparable signal infrastructure cannot trivially replicate the pattern.
Prompt-caching not mentioned. Yelp caches output, not prompt prefixes — a potential future cost-reduction axis.

Seen in¶

sources/2025-02-04-yelp-search-query-understanding-with-llms — canonical reference; first and only public disclosure on the wiki.

systems/yelp-search — parent production context
systems/gpt-4 / systems/gpt-4o-mini / systems/bert / systems/t5 — the model zoo
concepts/query-understanding — upstream concept
concepts/query-frequency-power-law-caching — the infrastructure primitive
concepts/implicit-query-location-rewrite — downstream consumer
concepts/review-highlight-phrase-expansion — the generative-task family
concepts/token-probability-as-ranking-signal — continuous- feature reuse
concepts/retrieval-augmented-generation — with two side- input instances
concepts/llm-cascade — related cost-routing pattern
patterns/three-phase-llm-productionization — Yelp's playbook
patterns/head-cache-plus-tail-finetuned-model — the shape
patterns/offline-teacher-online-student-distillation — the training half of the shape
patterns/rag-side-input-for-structured-extraction — the RAG role
companies/yelp