PATTERN Cited by 1 source

Parallel pre-retrieval classifier pipeline¶

Intent¶

Before any retrieval or generation runs, fan out the user question to multiple small classifiers in parallel — safety gate, scope gate, source selector, keyword generator. The request waits on the longest classifier, not the sum, and rejection by a gate cancels downstream work.

The pattern turns what would be a four-step serial pipeline into a one-step wide fan-out, materially reducing time-to- first-token (concepts/time-to-first-token) and saving tokens on rejected traffic.

Shape¶

              user question + conversation context
                            │
                            ▼
          ┌──────────┬──────────┬──────────┬──────────┐
          ▼          ▼          ▼          ▼
     ┌────────┐ ┌─────────┐ ┌──────────┐ ┌──────────┐
     │ T&S    │ │ Inquiry │ │ Content  │ │ Keyword  │
     │ (gate) │ │ Type    │ │ Source   │ │ Gen      │
     │        │ │ (gate)  │ │ Select   │ │          │
     └───┬────┘ └────┬────┘ └─────┬────┘ └─────┬────┘
         │           │            │            │
     unsafe?     out-of-scope?    │            │
         │           │            │            │
         └──────┬────┘            │            │
                │                 │            │
                ▼                 │            │
   templated redirect            (cancel downstream)
   (user never sees LLM)         (cancel downstream)
                                  │            │
                                  └─────┬──────┘
                                        ▼
                                    retrieval
                                        │
                                        ▼
                                  answer generation

When to apply¶

You operate an internet-facing LLM product with untrusted users. Production traffic will include prompt- injection attempts, out-of-scope prompts, and unsafe requests. See concepts/trust-and-safety-classifier and concepts/inquiry-type-classifier.
You have multiple independent pre-retrieval decisions that don't depend on each other's output (safety, scope, source pick, keywords). If one depends on another, use a sequential pipeline instead.
Latency matters. Time-to-first-token is a user-visible product metric — a serial classifier chain trades tokens-per- second for perceived responsiveness.

Mechanism¶

Fan-out via async orchestration¶

Build each classifier as an independent async task. Execute them concurrently (e.g. async langchain chains, asyncio.gather, Promise.all). The longest-running classifier sets the pipeline's latency floor.

Early-cancellation on gate rejection¶

When a gate classifier (Trust & Safety, Inquiry Type) rejects the question, cancel in-flight work from the non- gate classifiers (source selection, keyword generation) so you don't spend tokens/CPU on work that will be thrown away. Cancellation is where async runtimes pay for themselves.

Fallback templates for rejected traffic¶

Gate rejections route the user to pre-written templated responses — polite decline + redirect to the appropriate channel ("How do I change my password?" → support page link; "Show me good plumbers" → Yelp search). Users never see an LLM-generated response for rejected paths.

Canonical wiki instance — Yelp BAA question-analysis (2026-03-27)¶

Source: sources/2026-03-27-yelp-building-biz-ask-anything-from-prototype-to-product

Yelp runs four classifiers in parallel as the BAA question-analysis stage:

Classifier	Role	Model
Trust & Safety (concepts/trust-and-safety-classifier)	Gate — drop unsafe	fine-tuned GPT-4.1-nano
Inquiry Type (concepts/inquiry-type-classifier)	Gate — drop out-of-scope	fine-tuned GPT-4.1-nano
Content Source Selection	Pick subset of sources to consult	fine-tuned small
Keyword Generation	Emit vertical-specific search terms	fine-tuned small

Implementation substrate: async langchain chains. The post:

"We built the question analysis agents as asynchronous chains invoked through langchain, which meant that they can run in parallel and we just need to wait for the longest agent to complete. We also added early stopping to the components in the pipeline when the trust and safety classifier rejects the question to avoid waiting longer before responding."

Latency-budget impact: directly contributes to Yelp's p75 10-20 s → <3 s transformation. Source-selection and keyword- generation were split into two small models rather than one combined model — see patterns/split-source-selection-from-keyword-generation.

Why it works¶

Classifier-model latency (a few hundred ms on a fine-tuned small model) << retrieval + generation latency. Running four of them in parallel is roughly the cost of running one.
Fine-tuned small models are cheap enough to fan out. A frontier-model fan-out would have different economics.
Gate rejection is a common-enough case in internet-facing traffic that cancelling saves real tokens — it's not an edge optimization.

Failure modes¶

Gate classifier false positives reject legitimate questions. Mitigation: log rejected questions, re-label, re-fine-tune periodically. Yelp's fine-tune datasets heavily included "~50% questions that should be considered legitimate" as negatives specifically to control false- positive rate.
Source-selection model emits the wrong subset — the question is answerable but the chosen sources don't contain the evidence. Mitigation: evidence-relevance grader catches this offline.
Keyword generator returns keywords too vague or too specific. Yelp names this explicitly: "the core challenge was avoiding keywords that are too vague (match everything) or too specific (match nothing)." Mitigation: iterative prompt tuning against a labelled set.
Cancellation-race: classifier finishes emitting output just as cancellation arrives. The orchestration layer must tolerate either outcome cleanly.

Relation to sibling patterns¶

patterns/split-source-selection-from-keyword-generation — the decomposition that makes two of the four classifiers into separate parallel tasks rather than one combined task.
patterns/dynamic-prompt-composition-via-semantic-retrieval — consumes the classifier outputs to assemble the downstream answer-generation prompt.
patterns/head-cache-plus-tail-finetuned-model — the fine-tuned-small-model discipline that makes each classifier cheap enough to fan out in the first place.

Seen in¶

sources/2026-03-27-yelp-building-biz-ask-anything-from-prototype-to-product — canonical wiki instance. Four parallel classifiers (T&S, Inquiry Type, Content Source, Keyword Gen) via async langchain with early-stop on T&S rejection. Directly contributes to Yelp's p75 LLM-service latency going from 10-20 s to <3 s.