YELP 2026-03-27 Tier 3

Yelp — Building Biz Ask Anything: From Prototype to Product¶

Summary¶

Yelp Engineering post (2026-03-27) — the canonical first-party disclosure of how Yelp productionised Biz Ask Anything (BAA): a business-page LLM Q&A experience that retrieves evidence from Yelp's business content (reviews, photos, menus, structured facts, Ask the Community threads) and synthesises an evidence-backed, citation-linked answer. BAA is the business-page evolution of Yelp Assistant (originally shipped in 2024 for service-pro diagnosis) and represents a nine-month prototype→production arc. The post organises the retrospective around five challenges — data, question analysis, answer quality, performance, cost, user education — and discloses load-bearing architectural decisions at each. Prototype-to-production improvements reported: p75 latency 10-20s → <3s; cost 25% of the prototype after fine-tuned small models + context shrinkage; suggested-question engagement +~50% and unable-to-answer rate -~26% after switching from category-level to business-content-derived suggestions. Pipeline shape: user question → (in parallel) Trust & Safety classifier + Inquiry Type classifier + Content Source Selection + Keyword Generation → content-fetching engine (reviews/photos/website NRT indices + Cassandra EAV store) at <100 ms p95 → prompt composition → streaming answer generation → answer augmentation with photos → SSE delivery. Load-bearing data-layer decision: three near-real-time indices (reviews + photos-with-embeddings + website/menu/Ask-the-Community) at <10 min freshness for streamed sources; weekly batches for slower-changing websites/menus/AtC; Cassandra EAV (business_id, field_name, field_group, value, update_ts) for structured facts "since the data is consumed by an LLM that already expects unstructured strings". Replayable streams (concepts/stream-replayability-for-iterative-pipelines applied downstream) because "some datasets are derived from chains of 3-4 streams, so replayability and idempotent upserts are required." Four question-analysis components all running in parallel async langchain: Trust & Safety (fine-tuned GPT-4.1-nano on ~few-thousand question-label pairs, 50% negative + 50% synthetically-generated positives), Inquiry Type (fine- tuned GPT-4.1-nano on ~7K samples), Content Source Selection (split from Keyword Generation to "tune failure modes independently and debug decisions in isolation"), Keyword Generation (emits vertical-specific keyword expansions — "vegan options" in a hair salon vs. Mexican restaurant produces orthogonal expansions; returns no keywords for generic prompts like "what should I know about this place?" so recent-reviews without term constraints are fetched). Quality is a multi-dimensional product spec — Correctness/Faithfulness (CORRECT/UNVERIFIABLE/INCORRECT), Completeness (SUFFICIENT/ NEEDS_FOLLOW_UP/REFUSE_UNABLE/OFF_TOPIC), Evidence Relevance (RELEVANT/PARTIALLY_RELEVANT/NOT_RELEVANT) — each scored by a Langfuse-based LLM-as-judge grader in daily batch, with running averages to catch regressions; Style & Tone deferred as "a lot harder than the more objective tasks of correctness, answer completeness, and evidence relevance evaluations". Performance wins: streaming via FastAPI SSE (migrated from Pyramid for TTFB), OpenAI priority tier ("~20% inference speedup"), asynchronous langchain chains so the question- analysis stage waits on the longest agent (not the sum); early stopping on T&S rejection cancels downstream work. Cost wins: fine-tuned GPT-4.1-nano replacing frontier models for analysis; Aho-Corasick + sliding-window snippet extraction keyed on the generated keywords to shrink the user-prompt context; biz-content cleanup on website/menu data (largest contributors to prompt size); dynamic prompt composition via semantic search over few-shot examples — a single static monolithic prompt grew until "we're building a prompt assembly system that includes only the instructions, examples, and constraints relevant to the detected question type and content sources", with retrieval over the few-shot-example corpus deciding which examples ship per question. UX win: replacing LLM-authored category-level suggested questions ("generic breakfast options" for every French brunch place) with business-content-derived suggestions ("Can you order freshly baked bread to go?" for Parc, based on recent reviews + business summary + owner description) — +~50% engagement, -~26% inability-to-answer. Retrieval is keyword-first for reviews/website/AtC (with LLM keyword expansion adding ~0.8 s latency) and embedding+caption hybrid only for photos; embeddings-based review retrieval is explicitly a followup investment for "less keywordy" questions and to remove the keyword-expansion blocking call. Chat frontend / server-driven UI work is explicitly out of scope of this post (follow-up post promised).

Key takeaways¶

One LLM-friendly content-fetching API is the abstraction that unlocks multiple downstream LLM applications. Yelp's single "all or selected sources" API returns data in an LLM-friendly shape at <100 ms p95 — "an abstraction that helps other business-centric LLM applications besides ours." The EAV (business_id, field_name, field_group, value, updated_at) schema is deliberately non-normalised because the LLM downstream consumer "already expects unstructured strings" — schema-free consumption is the permission structure for the EAV choice. See systems/yelp-content-fetching-engine.
Pre-retrieval classifiers are not optional for an internet-facing LLM product. Yelp's prototype (internal audience) "didn't assume unsafe questions, or people not knowing what our application cannot do." Production traffic immediately surfaced prompt-injection attempts, illegal- activity prompts, and out-of-scope asks ("Show me good plumbers", "How do I change my password?", "Is there god?"). Yelp built two fine-tuned GPT-4.1-nano classifiers — Trust & Safety and Inquiry Type — that run in parallel with content fetching; T&S rejection cancels downstream work (cost + latency saving). See patterns/parallel-pre-retrieval-classifier-pipeline.
Splitting "pick sources + generate keywords" into two small models outperformed one combined pass. Yelp initially had a single LLM pick sources AND emit keywords; splitting them "lets us tune failure modes independently and debug decisions in isolation while optimizing for low latency." Canonical realisation: patterns/split-source-selection-from-keyword-generation.
Keyword-first retrieval hit quality targets faster than standing up an online embedding stack — for 2026-03 traffic. Yelp explicitly positions keyword search as the v1 ship decision with embeddings as a followup, accepting two known trade-offs: (a) ~0.8 s added by the LLM keyword- expansion call in the request path; (b) "recall is weaker on non-specific user questions." Keywords are skipped entirely for generic prompts — Yelp fetches recent reviews/photos without term constraints in that branch. This is a cache-miss taxonomy sibling of the 2025-02-04 head/ tail cascade. See concepts/retrieval-augmented-generation.
Quality is a 5-6 dimensional product spec graded by LLM judges against gold labels, with style-and-tone deliberately deferred. Yelp defines six dimensions — Correctness, Completeness/Helpfulness, Evidence Relevance, Conciseness, Structure, Tone — but only the first three are mechanised as Langfuse-based LLM-as-judge graders with labelled gold data. Style & Tone is judged "a lot harder than the more objective tasks" and deferred; hundreds of human- labelled answers seed each grader; daily batch runs on sampled production traffic produce a rolling-average time series. See concepts/llm-as-judge (extended), patterns/multi-dimensional-quality-grading conceptually anchored in this post.
TTFT is a product metric, not an engineering metric. Yelp's original p75 was 10-20 s end-to-end; the external launch target was p75 5 s on the backend LLM service, and they shipped at <3 s. The single biggest win was streaming via FastAPI SSE — because "[LLM output] would mean the user would be waiting > 10 s in the worst case for a response." Decoupled wins: OpenAI priority tier ("~20% inference speedup"), async langchain so question-analysis waits on the longest agent (not the sum), early-stopping on T&S rejection. See concepts/time-to-first-token (extended).
Context shrinkage + model-tier migration — not inference tricks — carried the cost optimisation. Yelp brought per-question cost to 25% of prototype. The three load- bearing levers: (a) fine-tuned GPT-4.1-nano replacing frontier models for question analysis (recursive application of Yelp's 2025-02-04 fine-tuning discipline); (b) Aho-Corasick + sliding-window snippet extraction driven by the keyword-generation output — patterns/aho-corasick-snippet-extraction shrinks the user-prompt context; (c) biz-content cleanup on website/menu text (largest prompt contributors). The eventual prompt-size win was reversed in the middle of the cost-optimisation cycle by growing system prompts, resolved by the follow-on dynamic prompt composition system that "extracts information by semantically searching across our few-shot examples to construct system prompts only with the examples that are relevant."
Suggested-question UX quality is a retrieval problem, not a prompt problem. Yelp's v1 suggestions were LLM-generated per business category (Mexican restaurants, bars, parks) — hitting "unanswerable with the available data" cases. v2 generates suggestions from the specific business's recent reviews + the business's pre-generated summary + the owner's description — content-grounded per-business rather than generic per-category. Outcome: +~50% engagement, -~26% inability-to-answer rate on suggested questions. See patterns/content-derived-suggested-questions.

Extracted systems¶

Yelp Biz Ask Anything (BAA) — the productionised business-page LLM Q&A system. Successor on business pages to the original Yelp Assistant (service-pro diagnosis).
Yelp content- fetching engine — the single LLM-friendly content-access API (<100 ms p95) that returns "all or selected sources" for a business.
Yelp Assistant (new stub) — the parent brand / UI; originally 2024-shipped chatbot for service-pro diagnosis; 2026-03-27 business-page evolution is the BAA variant.
Langfuse (new stub) — the LLM observability + experiment management platform on which Yelp's daily batch LLM-as-judge graders run.
FastAPI (new stub) — Python web framework with SSE support; Yelp migrated from Pyramid to FastAPI to unblock streaming-token responses.
LangChain (new stub) — Python LLM-orchestration library used for the asynchronous-chain parallel invocation of the four question-analysis classifiers.
GPT-4.1-nano (new stub) — OpenAI's small-model used as the fine-tuning base for both Yelp's T&S and Inquiry Type classifiers.
Apache Cassandra — the structured-facts EAV store; extended with a new Seen-in for schema-flexible LLM-feed storage.

Extracted concepts¶

concepts/trust-and-safety-classifier — the pre-retrieval safety-gate pattern: fine-tuned small model classifying user questions into safety categories before any retrieval or generation occurs.
concepts/inquiry-type-classifier — the pre-retrieval scope-gate pattern: small model deciding if the question is in-scope for the system's domain at all; graceful decline + user redirect on out-of-scope.
concepts/eav-schema-for-llm-consumption — the Entity-Attribute-Value schema permission when the downstream consumer is an LLM that accepts unstructured strings; trades schema rigidity for ingestion-pipeline isolation.
concepts/content-grounded-answer — the LLM-product discipline of confining answers to retrieved evidence, explicitly preventing scope drift and hallucinations; enforced at multiple layers (inquiry-type classifier, prompt constraints, evidence-relevance grader).
concepts/llm-as-judge — extended with Yelp's multi-dimensional grader fleet (Correctness + Completeness + Evidence Relevance) + Langfuse operational substrate + daily batch rolling-average pattern.
concepts/time-to-first-token — extended with Yelp's product-latency framing (p75 5 s target, <3 s achieved) and the SSE migration win.
concepts/retrieval-augmented-generation — extended with Yelp's keyword-first RAG variant (vs. embedding-first) and the no-keyword fallback for generic prompts.

Extracted patterns¶

patterns/parallel-pre-retrieval-classifier-pipeline — T&S + Inquiry Type + Content Source Selection + Keyword Generation all running in parallel via async langchain chains; the pipeline waits on the longest agent, not the sum; T&S rejection cancels downstream work.
patterns/split-source-selection-from-keyword-generation — content-source choice and search-term generation deliberately separated into two small models instead of one combined model: "lets us tune failure modes independently and debug decisions in isolation."
patterns/dynamic-prompt-composition-via-semantic-retrieval — replace the static monolithic system prompt with a per-request-assembled prompt where few-shot examples and instructions are retrieved via semantic search over a library, keyed on detected question type and content sources.
patterns/aho-corasick-snippet-extraction — extract relevant snippets from long review/website/AtC text using Aho-Corasick multi-pattern matching + sliding window keyed on the LLM-generated keywords, shrinking the user-prompt context before it hits the answer-generation model.
patterns/content-derived-suggested-questions — suggested-question generation from the specific business's own content (recent reviews + pre-generated summary + owner description) rather than LLM-generated generic suggestions per business category; measured engagement + answerability improvement.
patterns/head-cache-plus-tail-finetuned-model — extended with a second Yelp instance: BAA's classifier fleet (T&S + Inquiry Type + source-selection + keyword-gen) as a recursive application of Yelp's 2025-02-04 fine-tuned-small-model discipline at a different altitude (analysis, not caching).
patterns/three-phase-llm-productionization — extended with Yelp's BAA application: the 2026-03-27 post explicitly frames BAA as a nine-month prototype → production arc where the prototype's expensive-model + ad-hoc-data shape was systematically replaced by the production shape (parallel classifiers + fine-tuned small models + NRT data + streaming + graders).

Operational numbers¶

Data-layer freshness:
Reviews & photo captions: < 10 min (streamed from SoT databases)
Business structured information: < 10 min (streamed)
Websites, menus, Ask the Community business-page feature: weekly batches
Content-fetching API: < 100 ms p95, returning "all or selected sources" for a business
Question-analysis training data:
Trust & Safety classifier: ~few-thousand question-label pairs; ~50% legitimate questions mixed in as negatives; seed set hand-written then LLM-expanded for variation
Inquiry Type classifier: ~7K samples; realistic seeds from Ask-the-Community + early-traffic failures + synthetic paraphrases
Answer-latency transformation:
Prototype: p75 = 10-20 s end-to-end
External launch target: p75 = 5 s on the backend LLM service
Shipped: p75 < 3 s
OpenAI priority-tier impact: ~20% inference speedup
Keyword-expansion latency cost: ~0.8 s added to the request path by the LLM expansion call
Cost reduction: cost brought down to 25% of prototype
Quality graders: labels = 3-class per grader; hundreds of manually-labelled gold answers per criterion used to tune LLM-as-judge prompts
Suggested-question UX metrics (after switching from category-level LLM suggestions to business-content-derived suggestions):
+~50% engagement with suggested questions
-~26% inability-to-answer rate on suggested questions

Architecture diagrams (from the post)¶

Figure 1 — BAA screenshots across three question complexities (5-things, good-for-kids, 3-course-vegetarian- dinner).
Figure 2 — BAA Overview: life of a question (conversation context → analysis → retrieval → answer generation → augmentation → delivery).
Figure 3 — Data ingestion & indexing (streams + batches → NRT indices + Cassandra EAV → content-fetching engine).
Figure 4 — Question Analysis (four components in parallel — T&S, Inquiry Type, Source Selection, Keyword Generation — with T&S/Inquiry-Type rejection cancelling downstream work).
Figure 5 — BAA Quality Assessment (Langfuse-based grader extracting logs from production Q/A → daily batch → dataset on Langfuse).
Figure 6 — Suggested Questions UX (before/after comparison of category-level vs content-derived).

Caveats¶

"~20%" (priority tier), "~50%" (engagement), "~26%" (inability-to-answer), "95%" (review-highlight batch coverage) — Yelp writes these as approximations, consistent with the post's qualitative operational framing.
Latency distribution beyond p75 not disclosed — no p95 / p99 numbers; the post claims "significantly surpassed" the 5 s p75 target with <3 s but says nothing about the tail.
Cassandra read-path latency inside the <100 ms p95 is not broken out; we know only the envelope number.
Streaming migration cost — migrating from Pyramid to FastAPI is named but not quantified (size / duration of the migration).
Daily grader batch cadence — the judging is daily batch on sampled traffic, so regressions have up-to-one-day detection latency; Yelp does not disclose the alert threshold or sampling rate.
Style & Tone grader deferred — no production automated enforcement on brand voice; spot-checking + human review only. Yelp flags this as a planned investment.
Embeddings-for-reviews is a followup — the keyword-first RAG shape ships today with the known ~0.8 s expansion-call latency and weaker-on-non-specific-questions recall; Yelp explicitly names embeddings-based retrieval as the next investment.
Chat UX & server-driven UI explicitly out of scope — a follow-up post is promised; BAA's frontend is authored by a separate team.
Anthropic-style JSONL tool-output / function-call interop not disclosed — the post is OpenAI-centric (GPT-4.1-nano fine-tuned base, OpenAI priority tier, OpenAI batch API); no mention of multi-provider model routing.
Cost breakdown is a single "25% of prototype" headline, no per-stage contribution figures.

Source¶

companies/yelp — 8th on-scope Yelp ingest; opens the LLM Q&A over retrieved content axis, distinct from the 2025-02-04 LLM-for-query-understanding axis (BAA answers questions about a single business from its own content; QU rewrites the query for search retrieval).
sources/2025-02-04-yelp-search-query-understanding-with-llms — sibling Yelp LLM-infrastructure post. BAA re-uses the same fine-tune-small-model-on-curated-teacher-data discipline at the analysis-classifier altitude; shares the patterns/three-phase-llm-productionization playbook.
systems/yelp-biz-ask-anything — the system.
systems/yelp-content-fetching-engine — the reusable data-access API.
patterns/parallel-pre-retrieval-classifier-pipeline — the question-analysis pattern.
patterns/dynamic-prompt-composition-via-semantic-retrieval — the prompt-assembly pattern.
patterns/aho-corasick-snippet-extraction — the context- shrinkage pattern.
patterns/content-derived-suggested-questions — the UX pattern.
concepts/trust-and-safety-classifier — the safety gate.
concepts/inquiry-type-classifier — the scope gate.
concepts/content-grounded-answer — the product discipline.
concepts/eav-schema-for-llm-consumption — the structured-facts schema choice.
concepts/llm-as-judge — the quality-grading substrate.
concepts/time-to-first-token — the streaming-migration latency metric.
concepts/retrieval-augmented-generation — the retrieval-augmented shape BAA inherits.