Skip to content

SYSTEM Cited by 1 source

Yelp Query Understanding

Definition

Yelp Query Understanding is the LLM-powered query- processing pipeline that sits between the raw user query and Yelp Search's retrieval + ranking backend. It is the system canonicalised by the 2025-02-04 Yelp Engineering post "Search query understanding with LLMs: from ideation to production" (sources/2025-02-04-yelp-search-query-understanding-with-llms).

Tasks covered (as of 2025-02-04)

  • Query segmentation — assign labels from {topic, name, location, time, question, none} to each token-run of the query. Used downstream for name-matching, location rewrite, and filter auto-enablement.
  • Spell correction — fused with segmentation into a single LLM prompt; spell-corrected segments are meta-tagged with [spell corrected - high].
  • Review-highlight phrase expansion — creative generation of semantically-adjacent phrases so interesting review- snippets can be surfaced alongside each business.
  • Canonicalisation — mentioned in passing in the post's preamble; not detailed.

Architectural components (2025-02-04 snapshot)

                 raw query text
        ┌──────────────────────────────┐
        │   RAG side-input assembly    │
        │   (seg:  businesses viewed)  │
        │   (rh:   business categories)│
        └──────────┬───────────────────┘
┌──────────────────────────────────────────────────┐
│  Three-tier cascade (cache → batch → realtime)   │
│                                                   │
│   ┌───────────────┐                               │
│   │  head cache   │ ◀── pre-computed by GPT-4     │
│   │  (expensive   │     or fine-tuned GPT-4o-mini │
│   │   LLM output) │     via OpenAI batch API      │
│   └──────┬────────┘                               │
│          │ hit                                    │
│     miss │                                        │
│          ▼                                        │
│   ┌───────────────┐                               │
│   │  fine-tuned   │ ◀── offline batch coverage    │
│   │  GPT-4o-mini  │     (95%+ for review highl.)  │
│   └──────┬────────┘                               │
│          │ hit                                    │
│     miss │                                        │
│          ▼                                        │
│   ┌───────────────┐                               │
│   │  BERT / T5    │ ◀── realtime long-tail        │
│   │  (realtime)   │     serving                    │
│   └──────┬────────┘                               │
└──────────┼────────────────────────────────────────┘
        segmentation   phrase expansion   …
           │                 │
           ▼                 ▼
     ┌───────────┐     ┌───────────────┐
     │  Yelp     │     │  Review       │
     │  Search   │     │  highlighting │
     │  backend  │     │  sub-system   │
     └───────────┘     └───────────────┘

Notes per the post:

  • Cache layer: "caching (pre-computing) high-end LLM responses for only head queries above a certain frequency threshold". Generalised as concepts/query-frequency-power-law-caching.
  • Batch layer: fine-tuned GPT-4o-mini runs offline via OpenAI batch API calls; review-highlight expansion scaled to 95% of traffic via this path.
  • Realtime layer: BERT + T5 models serve the 5% tail that never hits the cache or batch path.
  • RAG side-inputs differ per task: segmentation uses "names of businesses that have been viewed for that query"; review-highlight uses "the most relevant business categories with respect to that query (from our in-house predictive model)".

Downstream consumers

  • Implicit location rewrite — when segmentation tags a {location} with high confidence, the search-backend geobox is rewritten within 30 miles of the user's search to the refined location. Canonical example: "epcot restaurants" → rewrite geobox from "Orlando, FL" to "Epcot, Bay Lake, FL". See concepts/implicit-query-location-rewrite.
  • Business-name matching — the token probability of the {name} tag is used as a continuous feature in Yelp's query-to-business-name matching + ranking system.
  • Auto-enabled filters — not fully detailed in the post, but listed as a downstream benefit of "more intelligent labeling of these tags."
  • Review-snippet selection — phrase expansion output feeds the review-highlight sub-system, which picks review passages matching the expanded phrases for display alongside search results.

Build history (per the post's three-phase process)

Yelp's canonical three-phase lifecycle — see patterns/three-phase-llm-productionization — applied to both running examples:

  1. Formulation (GPT-4): decide output schema, merge tasks where possible (segmentation + spell correction fused), decide RAG side-inputs.
  2. Proof of Concept (head-cache with expensive-LLM output): pre-compute for head queries, wire up cache, offline + online evals. Review-highlight A/B "increased Session / Search CTR across our platforms"; location-rewrite "achieved online metric wins".
  3. Scaling Up: fine-tune GPT-4o-mini on the GPT-4-generated
  4. curated golden dataset ("up to a 100x savings in cost"); pre-compute at tens-of-millions scale; deploy BERT/T5 for realtime tail.

Tradeoffs / gotchas

  • Cache freshness is tied to query-distribution drift. Head queries can change seasonally ("Christmas tree", "superbowl party"); the cache needs a refresh cadence appropriate to the drift rate. The post doesn't disclose Yelp's exact cadence.
  • Tail-query quality gap. BERT/T5 realtime output is implicitly lower quality than the offline GPT-4o-mini output; the gap is asserted but not measured in the post.
  • Fine-tuned student inherits teacher's biases. The curation step ("isolate sets of inputs that are likely to have been mislabeled and target these for human re-labeling or removal") mitigates but doesn't eliminate this.
  • RAG side-input pipeline is a separate dependency. The "businesses viewed for that query" and "predicted categories" are Yelp's own in-house signals; a user without a comparable signal infrastructure cannot trivially replicate the pattern.
  • Prompt-caching not mentioned. Yelp caches output, not prompt prefixes — a potential future cost-reduction axis.

Seen in

Last updated · 476 distilled / 1,218 read