SYSTEM Cited by 1 source

Yelp Biz Ask Anything (BAA)¶

Definition¶

Yelp Biz Ask Anything (BAA) is Yelp's production LLM- powered business-page question-answering system — the business-page evolution of Yelp Assistant (which originally shipped in 2024 for service-pro diagnosis). On a business page, users ask free-form questions like "Is the patio heated?", "Is this place good for kids?", or "Recommend a three-course dinner with two mains for two people, where one is vegetarian" and BAA returns a single, evidence-backed, citation-linked answer streamed token-by-token to the UI. The system is canonicalised by the 2026-03-27 Yelp Engineering post "Building Biz Ask Anything: From Prototype to Product" (sources/2026-03-27-yelp-building-biz-ask-anything-from-prototype-to-product).

BAA is the second LLM-production system Yelp has publicly disclosed, after Yelp Query Understanding (2025-02-04). The two systems operate at complementary altitudes:

System	Altitude	LLM role
Yelp Query Understanding	Rewrites the user's query before search retrieval	Query → tagged segments + expansions
Yelp BAA	Answers a user's question from a single business's retrieved content	Retrieved content + question → answer

BAA does not replace Yelp's long-form reviews — the product position is "support both behaviors, using AI to surface direct answers while preserving the value of in-depth reviews."

Life of a question (end-to-end flow)¶

     user question
         │
         ▼
┌──────────────────────────────────────────────────────┐
│         Conversation Context assembly                 │
│  (recent chat history for this user+business pair)    │
└──────────────────────┬───────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────┐
│    Question Analysis — 4 classifiers in parallel     │
│                                                      │
│   ┌──────────────┐ ┌──────────────┐                 │
│   │ Trust &      │ │ Inquiry Type │                 │
│   │ Safety       │ │ classifier   │                 │
│   │ (ft nano)    │ │ (ft nano)    │                 │
│   └──────┬───────┘ └──────┬───────┘                 │
│          │ unsafe          │ out-of-scope            │
│          └────┬─────────┬──┘                        │
│               ▼         ▼                           │
│         cancel + return templated / redirect        │
│                                                     │
│   ┌──────────────┐ ┌──────────────┐                │
│   │ Content      │ │ Keyword      │                │
│   │ Source       │ │ Generation   │                │
│   │ Selection    │ │ (vertical-   │                │
│   │ (ft model)   │ │  specific    │                │
│   │              │ │  expansions) │                │
│   └──────┬───────┘ └──────┬───────┘                │
│          └────────┬───────┘                        │
└─────────────────┬─┴──────────────────────────────────┘
                  │
                  ▼
┌──────────────────────────────────────────────────────┐
│     Retrieval via content-fetching engine            │
│   (NRT reviews + NRT photos + NRT website/AtC        │
│    + Cassandra EAV for structured facts)             │
│        ─── p95 < 100 ms ───                          │
└──────────────────────┬───────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────┐
│    Prompt composition                                │
│   (dynamic prompt assembly via semantic search over  │
│    few-shot examples + keyword-snippet extraction    │
│    via Aho-Corasick on retrieved text)               │
└──────────────────────┬───────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────┐
│    Answer Generation (streaming)                      │
│   Answer LLM → token stream via FastAPI SSE          │
│   OpenAI priority tier                               │
└──────────────────────┬───────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────┐
│    Answer Augmentation                                │
│    (attach relevant photos / visuals)                 │
└──────────────────────┬───────────────────────────────┘
                       │
                       ▼
                  UI (streamed)
                       │
                       ▼
┌──────────────────────────────────────────────────────┐
│    Logs + traces → Langfuse                          │
│    Daily batch: LLM-as-judge graders score sampled    │
│    Q/A pairs on Correctness + Completeness + Evidence │
│    Relevance → dataset on Langfuse                   │
└──────────────────────────────────────────────────────┘

Architectural components¶

Data layer¶

Three near-real-time indices plus a Cassandra structured-facts store, fronted by a single content-fetching API (systems/yelp-content-fetching-engine):

Reviews NRT index — each review is a document; <10 min freshness via streaming from SoT databases.
Photos NRT index — photo metadata + embeddings + photo captions; <10 min freshness via streaming.
Website / menu / Ask-the-Community NRT index — weekly batches (slower-moving content).
Cassandra structured-facts store — EAV schema (business_id, field_name, field_group, value, update_ts); the deliberately non-normalised shape is permissioned by the fact the downstream consumer is an LLM that accepts unstructured strings. See concepts/eav-schema-for-llm-consumption.
Ingestion — streaming from SoT → joins/transforms → data pipeline → Cassandra + NRT indexers (reviews, photos, structured info); weekly batches for websites, menus, AtC. Replayability + idempotent upserts are required because "some datasets are derived from chains of 3-4 streams".

Question analysis¶

Four components running in parallel via async langchain chains — the pipeline waits on the longest agent (not the sum):

Trust & Safety classifier (concepts/trust-and-safety-classifier) — fine-tuned GPT-4.1-nano on a few thousand question-label pairs (~50% legitimate questions mixed in as negatives). Labels unsafe questions (system attacks, illegal-activity prompts, prompt injection, etc.). On reject: templated safe answer, downstream work cancelled.
Inquiry Type classifier (concepts/inquiry-type-classifier) — fine-tuned GPT-4.1-nano on ~7K samples. Decides if the question is in-scope for a single-business, content- grounded answer. On out-of-scope: graceful decline + link to the right Yelp surface ("Show me good plumbers" → search; "How do I change my password?" → support).
Content Source Selection — returns the subset of sources (reviews / photos / menus / AtC / structured facts) to consult for this question, balancing topical relevance with subjectivity/objectivity profile (ambiance questions favour reviews; prices favour menus).
Keyword Generation — returns the search terms/phrases to pull from the chosen sources, with vertical-specific expansions based on business context:
Hair salon + "vegan options" → animal-free / cruelty- free / plant-based / vegan shampoo / vegan conditioner.
Mexican restaurant + "vegan options" → bean burrito / vegan tacos / dairy-free / cauliflower / nopales / tacos de pappa.
For generic prompts ("What should I know about this place?") — returns no keywords, so the retrieval path fetches recent content without term constraints.

Content Source Selection + Keyword Generation were originally a single LLM pass; Yelp split them into two fine-tuned models because "splitting into two small models improved precision and made failures easier to debug and lowered latency." See patterns/split-source-selection-from-keyword-generation.

Retrieval¶

Keyword-first for reviews / website / menus / AtC — Yelp's pre-existing IR/ranking stack + LLM keyword expansion.
Embedding + caption-text hybrid for photos.
No keywords branch for generic prompts — fetches recent content without term constraints.
Delivered via systems/yelp-content-fetching-engine at <100 ms p95.

Prompt composition¶

Snippet extraction via Aho-Corasick + sliding window — the generated keywords are matched against retrieved text to extract local snippets, shrinking the user-prompt context vs. shipping full review bodies.
Dynamic prompt composition — instead of a single static monolithic prompt, per-request assembly retrieves only the instructions, examples, and constraints relevant to the detected question type and content sources via semantic search over the few-shot-example corpus.

Answer generation¶

Frontier OpenAI model for the answering step (the specific variant isn't disclosed in the post).
Streaming via FastAPI SSE — migrated from Pyramid so response tokens stream to the UI as soon as the LLM emits them. "This was the biggest win" for TTFT.
OpenAI priority tier — gives "~20% inference speedup".
Citations inline — the model is prompted to cite retrieved evidence.

Answer augmentation¶

After text streams, relevant photos / visuals are attached to the answer.

Delivery & observability¶

SSE to the UI (token-by-token).
Logs + traces written per question.
Langfuse dataset — daily batch runs LLM-as-judge graders on sampled production traffic, computes rolling averages per quality dimension, stores as time series on Langfuse.

Quality assessment¶

BAA's quality is defined as a 5-6 dimensional product spec, with three dimensions mechanised as LLM-as-judge graders and the other two-to-three deferred:

Dimension	Grader	Labels
Correctness / Faithfulness	Yes (Langfuse batch)	CORRECT / UNVERIFIABLE / INCORRECT
Completeness / Helpfulness	Yes	SUFFICIENT / NEEDS_FOLLOW_UP / REFUSE_UNABLE / OFF_TOPIC
Evidence Relevance	Yes	RELEVANT / PARTIALLY_RELEVANT / NOT_RELEVANT
Conciseness	No — spot-checked	—
Structure	No — spot-checked	—
Tone / Style	Deferred — judged "a lot harder"	—

Each automated grader was tuned on hundreds of manually- labelled gold answers; daily batch runs on sampled production traffic produce rolling averages used to catch regressions. See concepts/llm-as-judge (extended Seen-in).

Suggested-question UX¶

v1: LLM-generated generic questions per business category (Mexican, bar, park). Hit "unanswerable with the available data" cases.
v2: generated from the specific business's own content — handful of recent reviews + pre-generated business summary
owner description. For Parc (Philadelphia), v2 surfaces "Can you order freshly baked bread to go?" — content- grounded to Parc's signature bread basket — instead of a generic breakfast question.
Outcome: +~50% engagement with suggested questions; -~26% inability-to-answer rate on suggested questions.
See patterns/content-derived-suggested-questions.

Performance¶

Prototype: p75 10-20 s end-to-end.
External launch target: p75 = 5 s on the backend LLM service.
Shipped: p75 < 3 s.
Biggest wins: streaming via FastAPI SSE for TTFT; OpenAI priority tier (~20% inference speedup); async langchain for parallel question analysis; early stopping on T&S rejection.

Cost¶

Cost at 25% of prototype via:
Fine-tuned GPT-4.1-nano for question analysis (vs. frontier models at prototype).
Aho-Corasick snippet extraction to shrink the user-prompt context.
Biz-content cleanup on website/menu text (largest prompt contributors).
Dynamic prompt composition to claw back the system- prompt growth that had eroded the input-token savings.
Question-answering model: Yelp explicitly stayed on frontier OpenAI models for the answer-generation step — "smaller models struggled despite extensive fine-tuning" — but migrated to a newer OpenAI model that delivered equivalent quality at lower cost. A quality grader gated the migration.

Recent-incident mitigation / debugging¶

Not explicitly detailed in the post beyond the general observability stance: logs + traces per question → Langfuse → daily grader batch. Yelp frames style-and-tone divergence as caught by "spot checking real traffic for edge-cases that we don't handle as we would like to" — human-in-the-loop rather than automated.

Tradeoffs / gotchas¶

Keyword-first retrieval adds ~0.8 s to the request path — the LLM keyword-expansion call is a blocking synchronous hop. Yelp accepts this for v1 on the basis that the mature keyword-search backend was production-ready sooner than an embeddings stack; embedding-based retrieval is explicitly called out as a followup.
Recall is weaker on non-specific user questions — keyword-first retrieval is a poor fit for "What's the ambiance here?"-type fuzzy-intent questions; embeddings are the planned mitigation.
Daily grader batch latency — regressions have up-to-one- day detection latency; no real-time quality alerting disclosed.
Style & Tone not automated — brand-voice quality relies on spot-checking + prompt-level guidance.
Single-point-of-failure on OpenAI — BAA is OpenAI- centric (GPT-4.1-nano fine-tuned base, priority tier, batch API). Multi-provider routing is not mentioned.
Prompt-size tug-of-war — Yelp's cost-optimisation cycle saved input tokens via context shrinkage, then lost those savings to system-prompt growth, then clawed them back via dynamic prompt composition. The absolute prompt-size metric isn't stable over time.
Head cache absent from BAA (unlike Yelp Query Understanding) — QU relies on a head-query cache because queries repeat. BAA's question + business-pair combinatorics is much higher; no analogous cache layer is mentioned. Suggested-question answers are flagged as a future short- term-cache target.
Conversation-context freshness — BAA augments with "recent chat history for that user-business pair", but the recency window / turn-count isn't disclosed.

Future work (flagged in the post)¶

Embeddings-based retrieval for review snippets and website content, plus ranking improvements.
Side-by-side business comparisons (new capability beyond single-business QA).
Multimedia inputs (images/voice).
Advanced caching — includes short-term caching of suggested-question answers to accelerate that UX.
More advanced dynamic prompt composition.
Search capability inside Yelp Assistant.

Seen in¶

sources/2026-03-27-yelp-building-biz-ask-anything-from-prototype-to-product — canonical first-party disclosure.

systems/yelp-assistant — the parent brand; BAA is the business-page evolution.
systems/yelp-content-fetching-engine — the LLM-friendly data-access API BAA depends on.
systems/yelp-query-understanding — sibling Yelp LLM system at the query-rewriting altitude.
systems/yelp-search — Yelp's search backend (retrieval + ranking); BAA's content-fetching engine shares the same NRT-index primitives.
systems/langfuse — LLM observability + batch grader substrate.
systems/fastapi — Python web framework enabling SSE.
systems/langchain — async-chain orchestration.
systems/gpt-4-1-nano — fine-tuning base for T&S + Inquiry Type.
systems/apache-cassandra — structured-facts EAV store.
concepts/trust-and-safety-classifier — the safety gate.
concepts/inquiry-type-classifier — the scope gate.
concepts/content-grounded-answer — the product discipline.
concepts/eav-schema-for-llm-consumption — the structured-facts schema permission.
concepts/llm-as-judge — the grader substrate.
concepts/time-to-first-token — the streaming-migration motivator.
concepts/retrieval-augmented-generation — the overall architectural family.
patterns/parallel-pre-retrieval-classifier-pipeline
patterns/split-source-selection-from-keyword-generation
patterns/dynamic-prompt-composition-via-semantic-retrieval
patterns/aho-corasick-snippet-extraction
patterns/content-derived-suggested-questions
patterns/head-cache-plus-tail-finetuned-model — recursive application in the classifier fleet.
patterns/three-phase-llm-productionization — Yelp's playbook applied to BAA.
companies/yelp