SYSTEM Cited by 1 source

Yelp LLM-Assisted Customer Success Chatbot¶

The LLM-Assisted Customer Success (CS) Chatbot is Yelp's production replacement for its legacy two-step menu-tree + fixed-phrase-matching support chatbot. Disclosed in the 2026-05-27 Yelp Engineering post by the Customer & Sales Intelligence Team (Source: sources/2026-05-27-yelp-beyond-the-menu-tree-how-yelp-built-a-smarter-customer-success-chatbot).

Architecture — two structurally distinct halves¶

(1) Specialized-workflow router¶

An LLM classifies the inbound query into one of five named workflows. The buckets are picked along two axes: frequency of inbound requests and risk class (churn / legal / financial).

Workflow	Mechanism	Risk class
Question/Answering (QA)	Default; RAG-driven LLM generation	Standard
Billing	Deterministic UI showing subscribed services + promotional balances	Low
Refund	Guides user through refund-form submission	Standard
Cancel	Templated response (no LLM generation)	High financial / legal
Review	Templated response (no LLM generation)	High financial / legal

The structural design choice: the LLM is the router; specialised handlers do the work. The LLM only generates free-form text in the QA workflow — Cancel and Review return canned text, Billing returns deterministic UI, Refund guides through a form. This is canonical LLM workflow router / specialized-workflow router with LLM intent detection.

(2) RAG pipeline (the QA workflow)¶

Four-step inference flow:

Embed the query with text-embedding-ada-002 (1,536 dim).
Similarity-search the in-memory FAISS vectorstore.
Inject top-5 unique articles into the LLM prompt with instructions "to generate an answer only based on the provided context".
Validate the LLM output: trust & safety checks, valid URL checks, character-limit checks. Three-axis output gate.

Vectorstore construction — metadata-only embedding¶

The post's load-bearing architectural disclosure. Yelp embeds the article's metadata as separate segments rather than chunking the article body. Per article, the segments are:

Title — one segment.
Summary — one segment.
Each top header — one segment per header.
Plus: a subset of historical intents/responses from the legacy chatbot (mined from menu-tree paths).

Each segment is embedded independently by ada-002 into a 1,536-dim unit vector. The whole article remains the retrieval unit (370 articles, not 370× chunks). Only the embedding signal is metadata-derived.

Why this works — the post's verbatim explanation:

"embedding large texts, such as entire articles or long paragraphs, can dilute the information signal, often leading to less accurate semantic matching. Concatenating too much text was observed to cause the semantic distances between vectors to get 'farther apart' in the embedding space because the key phrase we wanted to detect was mixed with too many unrelated words."

Worked example (verbatim):

"For instance, comparing a user's query 'reset password' against a vector representing an entire 500-word article on account management (which only mentions 'reset password' once) yields a poor match score because the signal is diluted."

The complementary failure mode of chunk-too-small is also disclosed:

"splitting the text into smaller chunks such as short paragraphs or sentences […] didn't perform well for our use case either since it may have resulted in too many false candidates."

Metadata-only embedding sits between these two extremes. Canonical instances of concepts/embedding-signal-dilution / concepts/metadata-only-embedding / concepts/whole-article-retrieval / patterns/whole-article-retrieval-via-metadata-segments.

Operational substrate¶

Property	Value
Knowledge base	~370 Yelp Support Center articles
Embedding model	OpenAI text-embedding-ada-002
Embedding dim	1,536
Total vectorstore	~8 MB on disk
Residency	In-memory inside the chatbot container
ANN engine	FAISS (smart indexing + quantization)
Update cadence	Daily batch → CSV → AWS S3
Index build	Container start (during health check)
Retrieval target	Top 5 unique articles
Over-fetch factor	`k = max_items_per_article × 5`
Quality	~94% recall@5 on evaluation dataset
A/B outcome	Doubled resolution rate vs legacy chatbot

Retrieval algorithm¶

The multi-segment-to-unique-article aggregation primitive:

Embed query.
Find k = max_items_per_article × 5 nearest vectors. The over-fetch factor × 5 ensures top-5 unique articles can be assembled even when several segments from the same article rank highly.
Apply an empirically-determined similarity threshold to filter low-confidence matches (prevents garbage from polluting LLM context).
Dedupe to unique articles (each article appears at most once in the result list).
Take top-5 unique articles.

Daily update pipeline¶

Verbatim from the post:

"1. Fetching and Processing: The job fetches updated articles from an internal endpoint. It converts the articles to markdown format, extracts the necessary headers, and constructs a CSV file containing all the candidate text and metadata. 2. Storage: This generated CSV file is uploaded to AWS S3 daily. 3. Loading: When the chatbot's container starts, it downloads the latest CSV from S3. The vectorstore, including the construction of the index and calculation of fresh embeddings for the newly fetched articles and metadata, is then dynamically loaded into memory during the health check process, ensuring the system operates with the freshest information."

Three structural choices:

Batch-not-stream — daily, not minute-level. Acceptable because Support Center articles change on the order of days, not seconds.
Data-not-index in S3 — the durable artifact in S3 is the CSV (text + metadata), not the FAISS index. The index is rebuilt every container start. This avoids the index-format-versioning problem.
Load-at-health-check — not lazy on first request. Container does not advertise as healthy until the vectorstore is loaded; ensures zero cold-start latency in the request path.

Canonical instance of daily S3 vectorstore update + in-memory vectorstore loaded at container start.

Hyperlink hallucination + validation¶

The post's most notable unexpected disclosure. Verbatim:

"One of the most notable unexpected challenges was the tendency of Large Language Models (LLMs) to hallucinate hyperlinks frequently. Since our knowledge base articles contain numerous hyperlinks, and we intended for the LLM-generated responses to include accurate links, this required a dedicated solution. To counteract this, we developed a process to reliably retrieve valid hyperlinks from the source articles and integrated specific validation checks. This verification process ensures that any link included in the final response genuinely originates from one of the retrieved Support Center articles and is not invented by the LLM."

The mitigation is structural, not prompt-engineering: extract URLs from the retrieved-context articles into an allowlist, validate every URL in the LLM output against the allowlist, strip/reject anything else. Canonical instance of hyperlink allowlist validation on LLM output and the LLM hyperlink hallucination sub-class of concepts/llm-hallucination.

Outcome¶

"By transforming a rigid, rule-based system into a dynamic, conversational agent, we doubled the chatbot resolution rate based on the A/B test result."

The 2× resolution-rate number is the only quantitative business outcome disclosed. No latency, cost, error-rate, or per-workflow breakdown.

Future-work flags¶

The post enumerates four ongoing improvement axes:

Dataset expansion — more query examples in the evaluation dataset.
Keyword experimentation — "using article keywords — a feature recently made available through the internal endpoint — to improve the accuracy of the article retrieval process."
Adding more documents — internal business-owner support articles, business glossary, chat transcripts with human support agents (the most-architecturally-interesting expansion: turning conversation logs into RAG substrate).
Optimizing context size — "increasing the number of retrieved articles may lead to a slight performance improvement. We can further search for the 'sweet spot' — the optimal number of articles that provides rich context without confusing the LLM or diluting the signal." The retrieval-context-size sweet spot as an open optimisation question.

Sibling Yelp LLM systems¶

systems/yelp-query-understanding — Yelp's 2025-02-04 search-query-understanding LLM serving architecture. Architectural-shape sibling: specialised-handler-per-task also organises that system (segmentation / spell correction / review-highlight expansion are the search-altitude analogues of QA / Billing / Refund / Cancel / Review). The CS Chatbot generalises the specialised-handler shape from search-query subtypes to customer-support intent classes.

Seen in¶

sources/2026-05-27-yelp-beyond-the-menu-tree-how-yelp-built-a-smarter-customer-success-chatbot