Yelp — Building Biz Ask Anything: From Prototype to Product¶
Summary¶
Yelp Engineering post (2026-03-27) — the canonical first-party
disclosure of how Yelp productionised Biz Ask Anything (BAA):
a business-page LLM Q&A experience that retrieves evidence from
Yelp's business content (reviews, photos, menus, structured
facts, Ask the Community threads) and synthesises an
evidence-backed, citation-linked answer. BAA is the
business-page evolution of Yelp
Assistant (originally shipped in 2024 for service-pro
diagnosis) and represents a nine-month prototype→production
arc. The post organises the retrospective around five
challenges — data, question analysis, answer quality,
performance, cost, user education — and discloses load-bearing
architectural decisions at each. Prototype-to-production
improvements reported: p75 latency 10-20s → <3s; cost 25%
of the prototype after fine-tuned small models + context
shrinkage; suggested-question engagement +~50% and
unable-to-answer rate -~26% after switching from
category-level to business-content-derived suggestions. Pipeline
shape: user question → (in parallel) Trust & Safety classifier
+ Inquiry Type classifier + Content Source Selection +
Keyword Generation →
content-fetching engine (reviews/photos/website NRT indices +
Cassandra EAV store) at <100 ms p95 → prompt composition →
streaming answer generation → answer augmentation with photos →
SSE delivery. Load-bearing data-layer decision: three
near-real-time indices (reviews + photos-with-embeddings +
website/menu/Ask-the-Community) at <10 min freshness for
streamed sources; weekly batches for slower-changing
websites/menus/AtC; Cassandra EAV (business_id, field_name,
field_group, value, update_ts) for structured facts "since
the data is consumed by an LLM that already expects unstructured
strings". Replayable streams (concepts/stream-replayability-for-iterative-pipelines
applied downstream) because "some datasets are derived from
chains of 3-4 streams, so replayability and idempotent upserts
are required." Four question-analysis components all running in
parallel async langchain: Trust & Safety (fine-tuned
GPT-4.1-nano on ~few-thousand question-label pairs, 50% negative
+ 50% synthetically-generated positives), Inquiry Type (fine-
tuned GPT-4.1-nano on ~7K samples), Content Source Selection
(split from Keyword Generation to "tune failure modes
independently and debug decisions in isolation"), Keyword
Generation (emits vertical-specific keyword expansions — "vegan
options" in a hair salon vs. Mexican restaurant produces
orthogonal expansions; returns no keywords for generic
prompts like "what should I know about this place?" so
recent-reviews without term constraints are fetched). Quality
is a multi-dimensional product spec — Correctness/Faithfulness
(CORRECT/UNVERIFIABLE/INCORRECT), Completeness (SUFFICIENT/
NEEDS_FOLLOW_UP/REFUSE_UNABLE/OFF_TOPIC), Evidence Relevance
(RELEVANT/PARTIALLY_RELEVANT/NOT_RELEVANT) — each scored by a
Langfuse-based LLM-as-judge grader in daily batch, with
running averages to catch regressions; Style & Tone deferred
as "a lot harder than the more objective tasks of correctness,
answer completeness, and evidence relevance evaluations".
Performance wins: streaming via FastAPI SSE (migrated from
Pyramid for TTFB), OpenAI priority tier ("~20% inference
speedup"), asynchronous langchain chains so the question-
analysis stage waits on the longest agent (not the sum); early
stopping on T&S rejection cancels downstream work. Cost wins:
fine-tuned GPT-4.1-nano replacing frontier models for
analysis; Aho-Corasick + sliding-window snippet extraction
keyed on the generated keywords to shrink the user-prompt
context; biz-content cleanup on website/menu data (largest
contributors to prompt size); dynamic prompt composition via
semantic search over few-shot examples — a single static
monolithic prompt grew until "we're building a prompt assembly
system that includes only the instructions, examples, and
constraints relevant to the detected question type and content
sources", with retrieval over the few-shot-example corpus
deciding which examples ship per question. UX win: replacing
LLM-authored category-level suggested questions ("generic
breakfast options" for every French brunch place) with
business-content-derived suggestions ("Can you order
freshly baked bread to go?" for Parc, based on recent reviews +
business summary + owner description) — +~50% engagement,
-~26% inability-to-answer. Retrieval is keyword-first for
reviews/website/AtC (with LLM keyword expansion adding ~0.8 s
latency) and embedding+caption hybrid only for photos;
embeddings-based review retrieval is explicitly a followup
investment for "less keywordy" questions and to remove the
keyword-expansion blocking call. Chat frontend / server-driven
UI work is explicitly out of scope of this post (follow-up
post promised).
Key takeaways¶
- One LLM-friendly content-fetching API is the abstraction
that unlocks multiple downstream LLM applications. Yelp's
single "all or selected sources" API returns data in an
LLM-friendly shape at <100 ms p95 — "an abstraction that
helps other business-centric LLM applications besides ours."
The EAV
(business_id, field_name, field_group, value, updated_at)schema is deliberately non-normalised because the LLM downstream consumer "already expects unstructured strings" — schema-free consumption is the permission structure for the EAV choice. See systems/yelp-content-fetching-engine. - Pre-retrieval classifiers are not optional for an internet-facing LLM product. Yelp's prototype (internal audience) "didn't assume unsafe questions, or people not knowing what our application cannot do." Production traffic immediately surfaced prompt-injection attempts, illegal- activity prompts, and out-of-scope asks ("Show me good plumbers", "How do I change my password?", "Is there god?"). Yelp built two fine-tuned GPT-4.1-nano classifiers — Trust & Safety and Inquiry Type — that run in parallel with content fetching; T&S rejection cancels downstream work (cost + latency saving). See patterns/parallel-pre-retrieval-classifier-pipeline.
- Splitting "pick sources + generate keywords" into two small models outperformed one combined pass. Yelp initially had a single LLM pick sources AND emit keywords; splitting them "lets us tune failure modes independently and debug decisions in isolation while optimizing for low latency." Canonical realisation: patterns/split-source-selection-from-keyword-generation.
- Keyword-first retrieval hit quality targets faster than standing up an online embedding stack — for 2026-03 traffic. Yelp explicitly positions keyword search as the v1 ship decision with embeddings as a followup, accepting two known trade-offs: (a) ~0.8 s added by the LLM keyword- expansion call in the request path; (b) "recall is weaker on non-specific user questions." Keywords are skipped entirely for generic prompts — Yelp fetches recent reviews/photos without term constraints in that branch. This is a cache-miss taxonomy sibling of the 2025-02-04 head/ tail cascade. See concepts/retrieval-augmented-generation.
- Quality is a 5-6 dimensional product spec graded by LLM judges against gold labels, with style-and-tone deliberately deferred. Yelp defines six dimensions — Correctness, Completeness/Helpfulness, Evidence Relevance, Conciseness, Structure, Tone — but only the first three are mechanised as Langfuse-based LLM-as-judge graders with labelled gold data. Style & Tone is judged "a lot harder than the more objective tasks" and deferred; hundreds of human- labelled answers seed each grader; daily batch runs on sampled production traffic produce a rolling-average time series. See concepts/llm-as-judge (extended), patterns/multi-dimensional-quality-grading conceptually anchored in this post.
- TTFT is a product metric, not an engineering metric. Yelp's original p75 was 10-20 s end-to-end; the external launch target was p75 5 s on the backend LLM service, and they shipped at <3 s. The single biggest win was streaming via FastAPI SSE — because "[LLM output] would mean the user would be waiting > 10 s in the worst case for a response." Decoupled wins: OpenAI priority tier ("~20% inference speedup"), async langchain so question-analysis waits on the longest agent (not the sum), early-stopping on T&S rejection. See concepts/time-to-first-token (extended).
- Context shrinkage + model-tier migration — not inference tricks — carried the cost optimisation. Yelp brought per-question cost to 25% of prototype. The three load- bearing levers: (a) fine-tuned GPT-4.1-nano replacing frontier models for question analysis (recursive application of Yelp's 2025-02-04 fine-tuning discipline); (b) Aho-Corasick + sliding-window snippet extraction driven by the keyword-generation output — patterns/aho-corasick-snippet-extraction shrinks the user-prompt context; (c) biz-content cleanup on website/menu text (largest prompt contributors). The eventual prompt-size win was reversed in the middle of the cost-optimisation cycle by growing system prompts, resolved by the follow-on dynamic prompt composition system that "extracts information by semantically searching across our few-shot examples to construct system prompts only with the examples that are relevant."
- Suggested-question UX quality is a retrieval problem, not a prompt problem. Yelp's v1 suggestions were LLM-generated per business category (Mexican restaurants, bars, parks) — hitting "unanswerable with the available data" cases. v2 generates suggestions from the specific business's recent reviews + the business's pre-generated summary + the owner's description — content-grounded per-business rather than generic per-category. Outcome: +~50% engagement, -~26% inability-to-answer rate on suggested questions. See patterns/content-derived-suggested-questions.
Extracted systems¶
- Yelp Biz Ask Anything (BAA) — the productionised business-page LLM Q&A system. Successor on business pages to the original Yelp Assistant (service-pro diagnosis).
- Yelp content- fetching engine — the single LLM-friendly content-access API (<100 ms p95) that returns "all or selected sources" for a business.
- Yelp Assistant (new stub) — the parent brand / UI; originally 2024-shipped chatbot for service-pro diagnosis; 2026-03-27 business-page evolution is the BAA variant.
- Langfuse (new stub) — the LLM observability + experiment management platform on which Yelp's daily batch LLM-as-judge graders run.
- FastAPI (new stub) — Python web framework with SSE support; Yelp migrated from Pyramid to FastAPI to unblock streaming-token responses.
- LangChain (new stub) — Python LLM-orchestration library used for the asynchronous-chain parallel invocation of the four question-analysis classifiers.
- GPT-4.1-nano (new stub) — OpenAI's small-model used as the fine-tuning base for both Yelp's T&S and Inquiry Type classifiers.
- Apache Cassandra — the structured-facts EAV store; extended with a new Seen-in for schema-flexible LLM-feed storage.
Extracted concepts¶
- concepts/trust-and-safety-classifier — the pre-retrieval safety-gate pattern: fine-tuned small model classifying user questions into safety categories before any retrieval or generation occurs.
- concepts/inquiry-type-classifier — the pre-retrieval scope-gate pattern: small model deciding if the question is in-scope for the system's domain at all; graceful decline + user redirect on out-of-scope.
- concepts/eav-schema-for-llm-consumption — the Entity-Attribute-Value schema permission when the downstream consumer is an LLM that accepts unstructured strings; trades schema rigidity for ingestion-pipeline isolation.
- concepts/content-grounded-answer — the LLM-product discipline of confining answers to retrieved evidence, explicitly preventing scope drift and hallucinations; enforced at multiple layers (inquiry-type classifier, prompt constraints, evidence-relevance grader).
- concepts/llm-as-judge — extended with Yelp's multi-dimensional grader fleet (Correctness + Completeness + Evidence Relevance) + Langfuse operational substrate + daily batch rolling-average pattern.
- concepts/time-to-first-token — extended with Yelp's product-latency framing (p75 5 s target, <3 s achieved) and the SSE migration win.
- concepts/retrieval-augmented-generation — extended with Yelp's keyword-first RAG variant (vs. embedding-first) and the no-keyword fallback for generic prompts.
Extracted patterns¶
- patterns/parallel-pre-retrieval-classifier-pipeline — T&S + Inquiry Type + Content Source Selection + Keyword Generation all running in parallel via async langchain chains; the pipeline waits on the longest agent, not the sum; T&S rejection cancels downstream work.
- patterns/split-source-selection-from-keyword-generation — content-source choice and search-term generation deliberately separated into two small models instead of one combined model: "lets us tune failure modes independently and debug decisions in isolation."
- patterns/dynamic-prompt-composition-via-semantic-retrieval — replace the static monolithic system prompt with a per-request-assembled prompt where few-shot examples and instructions are retrieved via semantic search over a library, keyed on detected question type and content sources.
- patterns/aho-corasick-snippet-extraction — extract relevant snippets from long review/website/AtC text using Aho-Corasick multi-pattern matching + sliding window keyed on the LLM-generated keywords, shrinking the user-prompt context before it hits the answer-generation model.
- patterns/content-derived-suggested-questions — suggested-question generation from the specific business's own content (recent reviews + pre-generated summary + owner description) rather than LLM-generated generic suggestions per business category; measured engagement + answerability improvement.
- patterns/head-cache-plus-tail-finetuned-model — extended with a second Yelp instance: BAA's classifier fleet (T&S + Inquiry Type + source-selection + keyword-gen) as a recursive application of Yelp's 2025-02-04 fine-tuned-small-model discipline at a different altitude (analysis, not caching).
- patterns/three-phase-llm-productionization — extended with Yelp's BAA application: the 2026-03-27 post explicitly frames BAA as a nine-month prototype → production arc where the prototype's expensive-model + ad-hoc-data shape was systematically replaced by the production shape (parallel classifiers + fine-tuned small models + NRT data + streaming + graders).
Operational numbers¶
- Data-layer freshness:
- Reviews & photo captions: < 10 min (streamed from SoT databases)
- Business structured information: < 10 min (streamed)
- Websites, menus, Ask the Community business-page feature: weekly batches
- Content-fetching API: < 100 ms p95, returning "all or selected sources" for a business
- Question-analysis training data:
- Trust & Safety classifier: ~few-thousand question-label pairs; ~50% legitimate questions mixed in as negatives; seed set hand-written then LLM-expanded for variation
- Inquiry Type classifier: ~7K samples; realistic seeds from Ask-the-Community + early-traffic failures + synthetic paraphrases
- Answer-latency transformation:
- Prototype: p75 = 10-20 s end-to-end
- External launch target: p75 = 5 s on the backend LLM service
- Shipped: p75 < 3 s
- OpenAI priority-tier impact: ~20% inference speedup
- Keyword-expansion latency cost: ~0.8 s added to the request path by the LLM expansion call
- Cost reduction: cost brought down to 25% of prototype
- Quality graders: labels = 3-class per grader; hundreds of manually-labelled gold answers per criterion used to tune LLM-as-judge prompts
- Suggested-question UX metrics (after switching from category-level LLM suggestions to business-content-derived suggestions):
- +~50% engagement with suggested questions
- -~26% inability-to-answer rate on suggested questions
Architecture diagrams (from the post)¶
- Figure 1 — BAA screenshots across three question complexities (5-things, good-for-kids, 3-course-vegetarian- dinner).
- Figure 2 — BAA Overview: life of a question (conversation context → analysis → retrieval → answer generation → augmentation → delivery).
- Figure 3 — Data ingestion & indexing (streams + batches → NRT indices + Cassandra EAV → content-fetching engine).
- Figure 4 — Question Analysis (four components in parallel — T&S, Inquiry Type, Source Selection, Keyword Generation — with T&S/Inquiry-Type rejection cancelling downstream work).
- Figure 5 — BAA Quality Assessment (Langfuse-based grader extracting logs from production Q/A → daily batch → dataset on Langfuse).
- Figure 6 — Suggested Questions UX (before/after comparison of category-level vs content-derived).
Caveats¶
- "~20%" (priority tier), "~50%" (engagement), "~26%" (inability-to-answer), "95%" (review-highlight batch coverage) — Yelp writes these as approximations, consistent with the post's qualitative operational framing.
- Latency distribution beyond p75 not disclosed — no p95 / p99 numbers; the post claims "significantly surpassed" the 5 s p75 target with <3 s but says nothing about the tail.
- Cassandra read-path latency inside the <100 ms p95 is not broken out; we know only the envelope number.
- Streaming migration cost — migrating from Pyramid to FastAPI is named but not quantified (size / duration of the migration).
- Daily grader batch cadence — the judging is daily batch on sampled traffic, so regressions have up-to-one-day detection latency; Yelp does not disclose the alert threshold or sampling rate.
- Style & Tone grader deferred — no production automated enforcement on brand voice; spot-checking + human review only. Yelp flags this as a planned investment.
- Embeddings-for-reviews is a followup — the keyword-first RAG shape ships today with the known ~0.8 s expansion-call latency and weaker-on-non-specific-questions recall; Yelp explicitly names embeddings-based retrieval as the next investment.
- Chat UX & server-driven UI explicitly out of scope — a follow-up post is promised; BAA's frontend is authored by a separate team.
- Anthropic-style JSONL tool-output / function-call interop not disclosed — the post is OpenAI-centric (GPT-4.1-nano fine-tuned base, OpenAI priority tier, OpenAI batch API); no mention of multi-provider model routing.
- Cost breakdown is a single "25% of prototype" headline, no per-stage contribution figures.
Source¶
- Original: https://engineeringblog.yelp.com/2026/03/building-baa-from-prototype-to-product.html
- Raw markdown:
raw/yelp/2026-03-27-building-biz-ask-anything-from-prototype-to-product-532d533d.md
Related¶
- companies/yelp — 8th on-scope Yelp ingest; opens the LLM Q&A over retrieved content axis, distinct from the 2025-02-04 LLM-for-query-understanding axis (BAA answers questions about a single business from its own content; QU rewrites the query for search retrieval).
- sources/2025-02-04-yelp-search-query-understanding-with-llms — sibling Yelp LLM-infrastructure post. BAA re-uses the same fine-tune-small-model-on-curated-teacher-data discipline at the analysis-classifier altitude; shares the patterns/three-phase-llm-productionization playbook.
- systems/yelp-biz-ask-anything — the system.
- systems/yelp-content-fetching-engine — the reusable data-access API.
- patterns/parallel-pre-retrieval-classifier-pipeline — the question-analysis pattern.
- patterns/dynamic-prompt-composition-via-semantic-retrieval — the prompt-assembly pattern.
- patterns/aho-corasick-snippet-extraction — the context- shrinkage pattern.
- patterns/content-derived-suggested-questions — the UX pattern.
- concepts/trust-and-safety-classifier — the safety gate.
- concepts/inquiry-type-classifier — the scope gate.
- concepts/content-grounded-answer — the product discipline.
- concepts/eav-schema-for-llm-consumption — the structured-facts schema choice.
- concepts/llm-as-judge — the quality-grading substrate.
- concepts/time-to-first-token — the streaming-migration latency metric.
- concepts/retrieval-augmented-generation — the retrieval-augmented shape BAA inherits.