SYSTEM Cited by 1 source
Yelp Biz Ask Anything (BAA)¶
Definition¶
Yelp Biz Ask Anything (BAA) is Yelp's production LLM- powered business-page question-answering system — the business-page evolution of Yelp Assistant (which originally shipped in 2024 for service-pro diagnosis). On a business page, users ask free-form questions like "Is the patio heated?", "Is this place good for kids?", or "Recommend a three-course dinner with two mains for two people, where one is vegetarian" and BAA returns a single, evidence-backed, citation-linked answer streamed token-by-token to the UI. The system is canonicalised by the 2026-03-27 Yelp Engineering post "Building Biz Ask Anything: From Prototype to Product" (sources/2026-03-27-yelp-building-biz-ask-anything-from-prototype-to-product).
BAA is the second LLM-production system Yelp has publicly disclosed, after Yelp Query Understanding (2025-02-04). The two systems operate at complementary altitudes:
| System | Altitude | LLM role |
|---|---|---|
| Yelp Query Understanding | Rewrites the user's query before search retrieval | Query → tagged segments + expansions |
| Yelp BAA | Answers a user's question from a single business's retrieved content | Retrieved content + question → answer |
BAA does not replace Yelp's long-form reviews — the product position is "support both behaviors, using AI to surface direct answers while preserving the value of in-depth reviews."
Life of a question (end-to-end flow)¶
user question
│
▼
┌──────────────────────────────────────────────────────┐
│ Conversation Context assembly │
│ (recent chat history for this user+business pair) │
└──────────────────────┬───────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Question Analysis — 4 classifiers in parallel │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Trust & │ │ Inquiry Type │ │
│ │ Safety │ │ classifier │ │
│ │ (ft nano) │ │ (ft nano) │ │
│ └──────┬───────┘ └──────┬───────┘ │
│ │ unsafe │ out-of-scope │
│ └────┬─────────┬──┘ │
│ ▼ ▼ │
│ cancel + return templated / redirect │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Content │ │ Keyword │ │
│ │ Source │ │ Generation │ │
│ │ Selection │ │ (vertical- │ │
│ │ (ft model) │ │ specific │ │
│ │ │ │ expansions) │ │
│ └──────┬───────┘ └──────┬───────┘ │
│ └────────┬───────┘ │
└─────────────────┬─┴──────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Retrieval via content-fetching engine │
│ (NRT reviews + NRT photos + NRT website/AtC │
│ + Cassandra EAV for structured facts) │
│ ─── p95 < 100 ms ─── │
└──────────────────────┬───────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Prompt composition │
│ (dynamic prompt assembly via semantic search over │
│ few-shot examples + keyword-snippet extraction │
│ via Aho-Corasick on retrieved text) │
└──────────────────────┬───────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Answer Generation (streaming) │
│ Answer LLM → token stream via FastAPI SSE │
│ OpenAI priority tier │
└──────────────────────┬───────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Answer Augmentation │
│ (attach relevant photos / visuals) │
└──────────────────────┬───────────────────────────────┘
│
▼
UI (streamed)
│
▼
┌──────────────────────────────────────────────────────┐
│ Logs + traces → Langfuse │
│ Daily batch: LLM-as-judge graders score sampled │
│ Q/A pairs on Correctness + Completeness + Evidence │
│ Relevance → dataset on Langfuse │
└──────────────────────────────────────────────────────┘
Architectural components¶
Data layer¶
Three near-real-time indices plus a Cassandra structured-facts store, fronted by a single content-fetching API (systems/yelp-content-fetching-engine):
- Reviews NRT index — each review is a document; <10 min freshness via streaming from SoT databases.
- Photos NRT index — photo metadata + embeddings + photo captions; <10 min freshness via streaming.
- Website / menu / Ask-the-Community NRT index — weekly batches (slower-moving content).
- Cassandra structured-facts store — EAV schema
(business_id, field_name, field_group, value, update_ts); the deliberately non-normalised shape is permissioned by the fact the downstream consumer is an LLM that accepts unstructured strings. See concepts/eav-schema-for-llm-consumption. - Ingestion — streaming from SoT → joins/transforms → data pipeline → Cassandra + NRT indexers (reviews, photos, structured info); weekly batches for websites, menus, AtC. Replayability + idempotent upserts are required because "some datasets are derived from chains of 3-4 streams".
Question analysis¶
Four components running in parallel via async langchain chains — the pipeline waits on the longest agent (not the sum):
- Trust & Safety classifier (concepts/trust-and-safety-classifier) — fine-tuned GPT-4.1-nano on a few thousand question-label pairs (~50% legitimate questions mixed in as negatives). Labels unsafe questions (system attacks, illegal-activity prompts, prompt injection, etc.). On reject: templated safe answer, downstream work cancelled.
- Inquiry Type classifier (concepts/inquiry-type-classifier) — fine-tuned GPT-4.1-nano on ~7K samples. Decides if the question is in-scope for a single-business, content- grounded answer. On out-of-scope: graceful decline + link to the right Yelp surface ("Show me good plumbers" → search; "How do I change my password?" → support).
- Content Source Selection — returns the subset of sources (reviews / photos / menus / AtC / structured facts) to consult for this question, balancing topical relevance with subjectivity/objectivity profile (ambiance questions favour reviews; prices favour menus).
- Keyword Generation — returns the search terms/phrases to pull from the chosen sources, with vertical-specific expansions based on business context:
- Hair salon + "vegan options" → animal-free / cruelty- free / plant-based / vegan shampoo / vegan conditioner.
- Mexican restaurant + "vegan options" → bean burrito / vegan tacos / dairy-free / cauliflower / nopales / tacos de pappa.
- For generic prompts ("What should I know about this place?") — returns no keywords, so the retrieval path fetches recent content without term constraints.
Content Source Selection + Keyword Generation were originally a single LLM pass; Yelp split them into two fine-tuned models because "splitting into two small models improved precision and made failures easier to debug and lowered latency." See patterns/split-source-selection-from-keyword-generation.
Retrieval¶
- Keyword-first for reviews / website / menus / AtC — Yelp's pre-existing IR/ranking stack + LLM keyword expansion.
- Embedding + caption-text hybrid for photos.
- No keywords branch for generic prompts — fetches recent content without term constraints.
- Delivered via systems/yelp-content-fetching-engine at <100 ms p95.
Prompt composition¶
- Snippet extraction via Aho-Corasick + sliding window — the generated keywords are matched against retrieved text to extract local snippets, shrinking the user-prompt context vs. shipping full review bodies.
- Dynamic prompt composition — instead of a single static monolithic prompt, per-request assembly retrieves only the instructions, examples, and constraints relevant to the detected question type and content sources via semantic search over the few-shot-example corpus.
Answer generation¶
- Frontier OpenAI model for the answering step (the specific variant isn't disclosed in the post).
- Streaming via FastAPI SSE — migrated from Pyramid so response tokens stream to the UI as soon as the LLM emits them. "This was the biggest win" for TTFT.
- OpenAI priority tier — gives "~20% inference speedup".
- Citations inline — the model is prompted to cite retrieved evidence.
Answer augmentation¶
- After text streams, relevant photos / visuals are attached to the answer.
Delivery & observability¶
- SSE to the UI (token-by-token).
- Logs + traces written per question.
- Langfuse dataset — daily batch runs LLM-as-judge graders on sampled production traffic, computes rolling averages per quality dimension, stores as time series on Langfuse.
Quality assessment¶
BAA's quality is defined as a 5-6 dimensional product spec, with three dimensions mechanised as LLM-as-judge graders and the other two-to-three deferred:
| Dimension | Grader | Labels |
|---|---|---|
| Correctness / Faithfulness | Yes (Langfuse batch) | CORRECT / UNVERIFIABLE / INCORRECT |
| Completeness / Helpfulness | Yes | SUFFICIENT / NEEDS_FOLLOW_UP / REFUSE_UNABLE / OFF_TOPIC |
| Evidence Relevance | Yes | RELEVANT / PARTIALLY_RELEVANT / NOT_RELEVANT |
| Conciseness | No — spot-checked | — |
| Structure | No — spot-checked | — |
| Tone / Style | Deferred — judged "a lot harder" | — |
Each automated grader was tuned on hundreds of manually- labelled gold answers; daily batch runs on sampled production traffic produce rolling averages used to catch regressions. See concepts/llm-as-judge (extended Seen-in).
Suggested-question UX¶
- v1: LLM-generated generic questions per business category (Mexican, bar, park). Hit "unanswerable with the available data" cases.
- v2: generated from the specific business's own content — handful of recent reviews + pre-generated business summary
- owner description. For Parc (Philadelphia), v2 surfaces "Can you order freshly baked bread to go?" — content- grounded to Parc's signature bread basket — instead of a generic breakfast question.
- Outcome: +~50% engagement with suggested questions; -~26% inability-to-answer rate on suggested questions.
- See patterns/content-derived-suggested-questions.
Performance¶
- Prototype: p75 10-20 s end-to-end.
- External launch target: p75 = 5 s on the backend LLM service.
- Shipped: p75 < 3 s.
- Biggest wins: streaming via FastAPI SSE for TTFT; OpenAI priority tier (~20% inference speedup); async langchain for parallel question analysis; early stopping on T&S rejection.
Cost¶
- Cost at 25% of prototype via:
- Fine-tuned GPT-4.1-nano for question analysis (vs. frontier models at prototype).
- Aho-Corasick snippet extraction to shrink the user-prompt context.
- Biz-content cleanup on website/menu text (largest prompt contributors).
- Dynamic prompt composition to claw back the system- prompt growth that had eroded the input-token savings.
- Question-answering model: Yelp explicitly stayed on frontier OpenAI models for the answer-generation step — "smaller models struggled despite extensive fine-tuning" — but migrated to a newer OpenAI model that delivered equivalent quality at lower cost. A quality grader gated the migration.
Recent-incident mitigation / debugging¶
Not explicitly detailed in the post beyond the general observability stance: logs + traces per question → Langfuse → daily grader batch. Yelp frames style-and-tone divergence as caught by "spot checking real traffic for edge-cases that we don't handle as we would like to" — human-in-the-loop rather than automated.
Tradeoffs / gotchas¶
- Keyword-first retrieval adds ~0.8 s to the request path — the LLM keyword-expansion call is a blocking synchronous hop. Yelp accepts this for v1 on the basis that the mature keyword-search backend was production-ready sooner than an embeddings stack; embedding-based retrieval is explicitly called out as a followup.
- Recall is weaker on non-specific user questions — keyword-first retrieval is a poor fit for "What's the ambiance here?"-type fuzzy-intent questions; embeddings are the planned mitigation.
- Daily grader batch latency — regressions have up-to-one- day detection latency; no real-time quality alerting disclosed.
- Style & Tone not automated — brand-voice quality relies on spot-checking + prompt-level guidance.
- Single-point-of-failure on OpenAI — BAA is OpenAI- centric (GPT-4.1-nano fine-tuned base, priority tier, batch API). Multi-provider routing is not mentioned.
- Prompt-size tug-of-war — Yelp's cost-optimisation cycle saved input tokens via context shrinkage, then lost those savings to system-prompt growth, then clawed them back via dynamic prompt composition. The absolute prompt-size metric isn't stable over time.
- Head cache absent from BAA (unlike Yelp Query Understanding) — QU relies on a head-query cache because queries repeat. BAA's question + business-pair combinatorics is much higher; no analogous cache layer is mentioned. Suggested-question answers are flagged as a future short- term-cache target.
- Conversation-context freshness — BAA augments with "recent chat history for that user-business pair", but the recency window / turn-count isn't disclosed.
Future work (flagged in the post)¶
- Embeddings-based retrieval for review snippets and website content, plus ranking improvements.
- Side-by-side business comparisons (new capability beyond single-business QA).
- Multimedia inputs (images/voice).
- Advanced caching — includes short-term caching of suggested-question answers to accelerate that UX.
- More advanced dynamic prompt composition.
- Search capability inside Yelp Assistant.
Seen in¶
- sources/2026-03-27-yelp-building-biz-ask-anything-from-prototype-to-product — canonical first-party disclosure.
Related¶
- systems/yelp-assistant — the parent brand; BAA is the business-page evolution.
- systems/yelp-content-fetching-engine — the LLM-friendly data-access API BAA depends on.
- systems/yelp-query-understanding — sibling Yelp LLM system at the query-rewriting altitude.
- systems/yelp-search — Yelp's search backend (retrieval + ranking); BAA's content-fetching engine shares the same NRT-index primitives.
- systems/langfuse — LLM observability + batch grader substrate.
- systems/fastapi — Python web framework enabling SSE.
- systems/langchain — async-chain orchestration.
- systems/gpt-4-1-nano — fine-tuning base for T&S + Inquiry Type.
- systems/apache-cassandra — structured-facts EAV store.
- concepts/trust-and-safety-classifier — the safety gate.
- concepts/inquiry-type-classifier — the scope gate.
- concepts/content-grounded-answer — the product discipline.
- concepts/eav-schema-for-llm-consumption — the structured-facts schema permission.
- concepts/llm-as-judge — the grader substrate.
- concepts/time-to-first-token — the streaming-migration motivator.
- concepts/retrieval-augmented-generation — the overall architectural family.
- patterns/parallel-pre-retrieval-classifier-pipeline
- patterns/split-source-selection-from-keyword-generation
- patterns/dynamic-prompt-composition-via-semantic-retrieval
- patterns/aho-corasick-snippet-extraction
- patterns/content-derived-suggested-questions
- patterns/head-cache-plus-tail-finetuned-model — recursive application in the classifier fleet.
- patterns/three-phase-llm-productionization — Yelp's playbook applied to BAA.
- companies/yelp