PATTERN Cited by 1 source

Split source selection from keyword generation¶

Intent¶

When a single LLM pass is doing two distinct decisions — "which sources to consult?" and "what search terms to use against them?" — split it into two small fine-tuned models. Each model can be tuned, debugged, and evaluated against the specific failure mode it's responsible for, and both can run in parallel (patterns/parallel-pre-retrieval-classifier-pipeline) rather than serialising through one prompt.

The pattern is a Single-Responsibility decomposition applied at the LLM-prompt level: one model per decision, not one mega-prompt per request.

When to apply¶

The combined task is actually two tasks that happen to be emitted in one JSON response (source list + keyword list).
Failure modes differ between the two sub-tasks — source- selection errors vs. keyword-quality errors look different in evals and need different fixes.
Combined-model output is hard to debug because you can't tell which sub-decision was wrong without re-reading the whole prompt trace.
You're running fine-tuned small models where the cost of running two is comparable to running one (both are cheap).

Mechanism¶

Decision A: Content Source Selection¶

Given the question, conversation history, and the catalog of available sources for the entity, emit the subset of sources the answerer should consult. Balances:

Topical relevance (ambiance questions want reviews, not menus).
Subjectivity / objectivity of the source (prices want menus; "is it good for kids?" wants reviews / photos).

Decision B: Keyword Generation¶

Given the question, conversation history, and business context (name, categories, location), emit a compact list of search terms to pull the right chunks from the chosen sources. Business context steers synonyms toward vertical-specific vocabulary.

A key sub-decision: emit no keywords for generic prompts. For questions like "What should I know about this place?" keywording adds noise — the downstream retrieval should fetch recent reviews/photos without term constraints.

Independent tuning¶

Each model gets its own fine-tune dataset, its own eval set, and its own rollback path. A regression in keyword generation doesn't force a re-tune of source selection.

Canonical wiki instance — Yelp BAA (2026-03-27)¶

Source: sources/2026-03-27-yelp-building-biz-ask-anything-from-prototype-to-product

The prototype used a single LLM pass for (a) source selection + (b) keyword generation. In production, Yelp split them into two lightweight fine-tuned models:

"Initially we used a single LLM pass to (a) pick sources and (b) generate search terms but eventually split into two more lightweight fine-tuned models. Decoupling the two tasks lets us tune failure modes independently and debug decisions in isolation while optimizing for low latency."

"Split and parallelize tasks that look similar: 'Pick sources + generate keywords' in one pass underperformed. Splitting into two small models improved precision and made failures easier to debug and lowered latency."

The split also enabled both models to run in parallel with the Trust & Safety and Inquiry Type classifiers — see patterns/parallel-pre-retrieval-classifier-pipeline for the host pattern.

Vertical-specific keyword example from the Yelp post¶

The question "Do they have vegan options?" produces radically different keywords depending on the business category:

Hair salon: animal-free, cruelty-free, plant-based, vegan shampoo, vegan conditioner, vegan styling, vegan products, vegan hair dye.
Mexican restaurant: vegan, bean burrito, vegan tacos, dairy-free, cauliflower, beans, cactus, nopales, enchiladas verdes, tacos de pappa, vegetarian.

Yelp's framing of the keyword-generation failure mode:

"The core challenge was avoiding keywords that are too vague (match everything) or too specific (match nothing). We iterated on prompts and fine-tunes to hit a sweet spot that yields diverse but tight evidence sets."

Why it works¶

Failure-mode isolation. A bad keyword list is not a bad source list. Keeping them in one model conflates the gradient signals during fine-tuning.
Parallelism unlock. Two independent small models can run in parallel; one combined model serialises the two decisions inside the prompt.
Debuggability. When a bad retrieval happens, the operator can see which of the two models produced the upstream error — source list or keyword list — without re-reading the combined prompt's reasoning.

Failure modes¶

Coupling leaks — the source-selection model's output subtly influences the keyword generator via shared conversation context. Keep the two models' inputs as independent as possible.
Inconsistent signals between the two models — source selection picks only menus, but keyword generation emits ambiance keywords. Mitigation: eval dataset should include ambiance-on-menu counter-examples.
Over-split: if a task is genuinely one decision (e.g. classify + explain), splitting it just doubles the infra without a debuggability win. Split only when the failure modes are actually different.

Relation to sibling patterns¶

patterns/parallel-pre-retrieval-classifier-pipeline — the host pattern; splitting source+keyword is what turns the retrieval pre-stage from 3 agents into 4 parallel agents.
patterns/dynamic-prompt-composition-via-semantic-retrieval — downstream pattern that consumes both split models' outputs to assemble the answer-generation prompt.

Seen in¶

sources/2026-03-27-yelp-building-biz-ask-anything-from-prototype-to-product — canonical wiki instance. Yelp split a single source + keyword LLM pass into two fine-tuned small models after the combined version underperformed; decoupling enabled independent failure-mode tuning and reduced latency.