Skip to content

SYSTEM Cited by 1 source

Yelp Biz Ask Anything (BAA)

Definition

Yelp Biz Ask Anything (BAA) is Yelp's production LLM- powered business-page question-answering system — the business-page evolution of Yelp Assistant (which originally shipped in 2024 for service-pro diagnosis). On a business page, users ask free-form questions like "Is the patio heated?", "Is this place good for kids?", or "Recommend a three-course dinner with two mains for two people, where one is vegetarian" and BAA returns a single, evidence-backed, citation-linked answer streamed token-by-token to the UI. The system is canonicalised by the 2026-03-27 Yelp Engineering post "Building Biz Ask Anything: From Prototype to Product" (sources/2026-03-27-yelp-building-biz-ask-anything-from-prototype-to-product).

BAA is the second LLM-production system Yelp has publicly disclosed, after Yelp Query Understanding (2025-02-04). The two systems operate at complementary altitudes:

System Altitude LLM role
Yelp Query Understanding Rewrites the user's query before search retrieval Query → tagged segments + expansions
Yelp BAA Answers a user's question from a single business's retrieved content Retrieved content + question → answer

BAA does not replace Yelp's long-form reviews — the product position is "support both behaviors, using AI to surface direct answers while preserving the value of in-depth reviews."

Life of a question (end-to-end flow)

     user question
┌──────────────────────────────────────────────────────┐
│         Conversation Context assembly                 │
│  (recent chat history for this user+business pair)    │
└──────────────────────┬───────────────────────────────┘
┌──────────────────────────────────────────────────────┐
│    Question Analysis — 4 classifiers in parallel     │
│                                                      │
│   ┌──────────────┐ ┌──────────────┐                 │
│   │ Trust &      │ │ Inquiry Type │                 │
│   │ Safety       │ │ classifier   │                 │
│   │ (ft nano)    │ │ (ft nano)    │                 │
│   └──────┬───────┘ └──────┬───────┘                 │
│          │ unsafe          │ out-of-scope            │
│          └────┬─────────┬──┘                        │
│               ▼         ▼                           │
│         cancel + return templated / redirect        │
│                                                     │
│   ┌──────────────┐ ┌──────────────┐                │
│   │ Content      │ │ Keyword      │                │
│   │ Source       │ │ Generation   │                │
│   │ Selection    │ │ (vertical-   │                │
│   │ (ft model)   │ │  specific    │                │
│   │              │ │  expansions) │                │
│   └──────┬───────┘ └──────┬───────┘                │
│          └────────┬───────┘                        │
└─────────────────┬─┴──────────────────────────────────┘
┌──────────────────────────────────────────────────────┐
│     Retrieval via content-fetching engine            │
│   (NRT reviews + NRT photos + NRT website/AtC        │
│    + Cassandra EAV for structured facts)             │
│        ─── p95 < 100 ms ───                          │
└──────────────────────┬───────────────────────────────┘
┌──────────────────────────────────────────────────────┐
│    Prompt composition                                │
│   (dynamic prompt assembly via semantic search over  │
│    few-shot examples + keyword-snippet extraction    │
│    via Aho-Corasick on retrieved text)               │
└──────────────────────┬───────────────────────────────┘
┌──────────────────────────────────────────────────────┐
│    Answer Generation (streaming)                      │
│   Answer LLM → token stream via FastAPI SSE          │
│   OpenAI priority tier                               │
└──────────────────────┬───────────────────────────────┘
┌──────────────────────────────────────────────────────┐
│    Answer Augmentation                                │
│    (attach relevant photos / visuals)                 │
└──────────────────────┬───────────────────────────────┘
                  UI (streamed)
┌──────────────────────────────────────────────────────┐
│    Logs + traces → Langfuse                          │
│    Daily batch: LLM-as-judge graders score sampled    │
│    Q/A pairs on Correctness + Completeness + Evidence │
│    Relevance → dataset on Langfuse                   │
└──────────────────────────────────────────────────────┘

Architectural components

Data layer

Three near-real-time indices plus a Cassandra structured-facts store, fronted by a single content-fetching API (systems/yelp-content-fetching-engine):

  • Reviews NRT index — each review is a document; <10 min freshness via streaming from SoT databases.
  • Photos NRT index — photo metadata + embeddings + photo captions; <10 min freshness via streaming.
  • Website / menu / Ask-the-Community NRT index — weekly batches (slower-moving content).
  • Cassandra structured-facts store — EAV schema (business_id, field_name, field_group, value, update_ts); the deliberately non-normalised shape is permissioned by the fact the downstream consumer is an LLM that accepts unstructured strings. See concepts/eav-schema-for-llm-consumption.
  • Ingestion — streaming from SoT → joins/transforms → data pipeline → Cassandra + NRT indexers (reviews, photos, structured info); weekly batches for websites, menus, AtC. Replayability + idempotent upserts are required because "some datasets are derived from chains of 3-4 streams".

Question analysis

Four components running in parallel via async langchain chains — the pipeline waits on the longest agent (not the sum):

  • Trust & Safety classifier (concepts/trust-and-safety-classifier) — fine-tuned GPT-4.1-nano on a few thousand question-label pairs (~50% legitimate questions mixed in as negatives). Labels unsafe questions (system attacks, illegal-activity prompts, prompt injection, etc.). On reject: templated safe answer, downstream work cancelled.
  • Inquiry Type classifier (concepts/inquiry-type-classifier) — fine-tuned GPT-4.1-nano on ~7K samples. Decides if the question is in-scope for a single-business, content- grounded answer. On out-of-scope: graceful decline + link to the right Yelp surface ("Show me good plumbers" → search; "How do I change my password?" → support).
  • Content Source Selection — returns the subset of sources (reviews / photos / menus / AtC / structured facts) to consult for this question, balancing topical relevance with subjectivity/objectivity profile (ambiance questions favour reviews; prices favour menus).
  • Keyword Generation — returns the search terms/phrases to pull from the chosen sources, with vertical-specific expansions based on business context:
  • Hair salon + "vegan options" → animal-free / cruelty- free / plant-based / vegan shampoo / vegan conditioner.
  • Mexican restaurant + "vegan options" → bean burrito / vegan tacos / dairy-free / cauliflower / nopales / tacos de pappa.
  • For generic prompts ("What should I know about this place?") — returns no keywords, so the retrieval path fetches recent content without term constraints.

Content Source Selection + Keyword Generation were originally a single LLM pass; Yelp split them into two fine-tuned models because "splitting into two small models improved precision and made failures easier to debug and lowered latency." See patterns/split-source-selection-from-keyword-generation.

Retrieval

  • Keyword-first for reviews / website / menus / AtC — Yelp's pre-existing IR/ranking stack + LLM keyword expansion.
  • Embedding + caption-text hybrid for photos.
  • No keywords branch for generic prompts — fetches recent content without term constraints.
  • Delivered via systems/yelp-content-fetching-engine at <100 ms p95.

Prompt composition

  • Snippet extraction via Aho-Corasick + sliding window — the generated keywords are matched against retrieved text to extract local snippets, shrinking the user-prompt context vs. shipping full review bodies.
  • Dynamic prompt composition — instead of a single static monolithic prompt, per-request assembly retrieves only the instructions, examples, and constraints relevant to the detected question type and content sources via semantic search over the few-shot-example corpus.

Answer generation

  • Frontier OpenAI model for the answering step (the specific variant isn't disclosed in the post).
  • Streaming via FastAPI SSE — migrated from Pyramid so response tokens stream to the UI as soon as the LLM emits them. "This was the biggest win" for TTFT.
  • OpenAI priority tier — gives "~20% inference speedup".
  • Citations inline — the model is prompted to cite retrieved evidence.

Answer augmentation

  • After text streams, relevant photos / visuals are attached to the answer.

Delivery & observability

  • SSE to the UI (token-by-token).
  • Logs + traces written per question.
  • Langfuse dataset — daily batch runs LLM-as-judge graders on sampled production traffic, computes rolling averages per quality dimension, stores as time series on Langfuse.

Quality assessment

BAA's quality is defined as a 5-6 dimensional product spec, with three dimensions mechanised as LLM-as-judge graders and the other two-to-three deferred:

Dimension Grader Labels
Correctness / Faithfulness Yes (Langfuse batch) CORRECT / UNVERIFIABLE / INCORRECT
Completeness / Helpfulness Yes SUFFICIENT / NEEDS_FOLLOW_UP / REFUSE_UNABLE / OFF_TOPIC
Evidence Relevance Yes RELEVANT / PARTIALLY_RELEVANT / NOT_RELEVANT
Conciseness No — spot-checked
Structure No — spot-checked
Tone / Style Deferred — judged "a lot harder"

Each automated grader was tuned on hundreds of manually- labelled gold answers; daily batch runs on sampled production traffic produce rolling averages used to catch regressions. See concepts/llm-as-judge (extended Seen-in).

Suggested-question UX

  • v1: LLM-generated generic questions per business category (Mexican, bar, park). Hit "unanswerable with the available data" cases.
  • v2: generated from the specific business's own content — handful of recent reviews + pre-generated business summary
  • owner description. For Parc (Philadelphia), v2 surfaces "Can you order freshly baked bread to go?" — content- grounded to Parc's signature bread basket — instead of a generic breakfast question.
  • Outcome: +~50% engagement with suggested questions; -~26% inability-to-answer rate on suggested questions.
  • See patterns/content-derived-suggested-questions.

Performance

  • Prototype: p75 10-20 s end-to-end.
  • External launch target: p75 = 5 s on the backend LLM service.
  • Shipped: p75 < 3 s.
  • Biggest wins: streaming via FastAPI SSE for TTFT; OpenAI priority tier (~20% inference speedup); async langchain for parallel question analysis; early stopping on T&S rejection.

Cost

  • Cost at 25% of prototype via:
  • Fine-tuned GPT-4.1-nano for question analysis (vs. frontier models at prototype).
  • Aho-Corasick snippet extraction to shrink the user-prompt context.
  • Biz-content cleanup on website/menu text (largest prompt contributors).
  • Dynamic prompt composition to claw back the system- prompt growth that had eroded the input-token savings.
  • Question-answering model: Yelp explicitly stayed on frontier OpenAI models for the answer-generation step — "smaller models struggled despite extensive fine-tuning" — but migrated to a newer OpenAI model that delivered equivalent quality at lower cost. A quality grader gated the migration.

Recent-incident mitigation / debugging

Not explicitly detailed in the post beyond the general observability stance: logs + traces per question → Langfuse → daily grader batch. Yelp frames style-and-tone divergence as caught by "spot checking real traffic for edge-cases that we don't handle as we would like to" — human-in-the-loop rather than automated.

Tradeoffs / gotchas

  • Keyword-first retrieval adds ~0.8 s to the request path — the LLM keyword-expansion call is a blocking synchronous hop. Yelp accepts this for v1 on the basis that the mature keyword-search backend was production-ready sooner than an embeddings stack; embedding-based retrieval is explicitly called out as a followup.
  • Recall is weaker on non-specific user questions — keyword-first retrieval is a poor fit for "What's the ambiance here?"-type fuzzy-intent questions; embeddings are the planned mitigation.
  • Daily grader batch latency — regressions have up-to-one- day detection latency; no real-time quality alerting disclosed.
  • Style & Tone not automated — brand-voice quality relies on spot-checking + prompt-level guidance.
  • Single-point-of-failure on OpenAI — BAA is OpenAI- centric (GPT-4.1-nano fine-tuned base, priority tier, batch API). Multi-provider routing is not mentioned.
  • Prompt-size tug-of-war — Yelp's cost-optimisation cycle saved input tokens via context shrinkage, then lost those savings to system-prompt growth, then clawed them back via dynamic prompt composition. The absolute prompt-size metric isn't stable over time.
  • Head cache absent from BAA (unlike Yelp Query Understanding) — QU relies on a head-query cache because queries repeat. BAA's question + business-pair combinatorics is much higher; no analogous cache layer is mentioned. Suggested-question answers are flagged as a future short- term-cache target.
  • Conversation-context freshness — BAA augments with "recent chat history for that user-business pair", but the recency window / turn-count isn't disclosed.

Future work (flagged in the post)

  • Embeddings-based retrieval for review snippets and website content, plus ranking improvements.
  • Side-by-side business comparisons (new capability beyond single-business QA).
  • Multimedia inputs (images/voice).
  • Advanced caching — includes short-term caching of suggested-question answers to accelerate that UX.
  • More advanced dynamic prompt composition.
  • Search capability inside Yelp Assistant.

Seen in

Last updated · 476 distilled / 1,218 read