Skip to content

PATTERN Cited by 1 source

Parallel pre-retrieval classifier pipeline

Intent

Before any retrieval or generation runs, fan out the user question to multiple small classifiers in parallel — safety gate, scope gate, source selector, keyword generator. The request waits on the longest classifier, not the sum, and rejection by a gate cancels downstream work.

The pattern turns what would be a four-step serial pipeline into a one-step wide fan-out, materially reducing time-to- first-token (concepts/time-to-first-token) and saving tokens on rejected traffic.

Shape

              user question + conversation context
          ┌──────────┬──────────┬──────────┬──────────┐
          ▼          ▼          ▼          ▼
     ┌────────┐ ┌─────────┐ ┌──────────┐ ┌──────────┐
     │ T&S    │ │ Inquiry │ │ Content  │ │ Keyword  │
     │ (gate) │ │ Type    │ │ Source   │ │ Gen      │
     │        │ │ (gate)  │ │ Select   │ │          │
     └───┬────┘ └────┬────┘ └─────┬────┘ └─────┬────┘
         │           │            │            │
     unsafe?     out-of-scope?    │            │
         │           │            │            │
         └──────┬────┘            │            │
                │                 │            │
                ▼                 │            │
   templated redirect            (cancel downstream)
   (user never sees LLM)         (cancel downstream)
                                  │            │
                                  └─────┬──────┘
                                    retrieval
                                  answer generation

When to apply

  • You operate an internet-facing LLM product with untrusted users. Production traffic will include prompt- injection attempts, out-of-scope prompts, and unsafe requests. See concepts/trust-and-safety-classifier and concepts/inquiry-type-classifier.
  • You have multiple independent pre-retrieval decisions that don't depend on each other's output (safety, scope, source pick, keywords). If one depends on another, use a sequential pipeline instead.
  • Latency matters. Time-to-first-token is a user-visible product metric — a serial classifier chain trades tokens-per- second for perceived responsiveness.

Mechanism

Fan-out via async orchestration

Build each classifier as an independent async task. Execute them concurrently (e.g. async langchain chains, asyncio.gather, Promise.all). The longest-running classifier sets the pipeline's latency floor.

Early-cancellation on gate rejection

When a gate classifier (Trust & Safety, Inquiry Type) rejects the question, cancel in-flight work from the non- gate classifiers (source selection, keyword generation) so you don't spend tokens/CPU on work that will be thrown away. Cancellation is where async runtimes pay for themselves.

Fallback templates for rejected traffic

Gate rejections route the user to pre-written templated responses — polite decline + redirect to the appropriate channel ("How do I change my password?" → support page link; "Show me good plumbers" → Yelp search). Users never see an LLM-generated response for rejected paths.

Canonical wiki instance — Yelp BAA question-analysis (2026-03-27)

Source: sources/2026-03-27-yelp-building-biz-ask-anything-from-prototype-to-product

Yelp runs four classifiers in parallel as the BAA question-analysis stage:

Classifier Role Model
Trust & Safety (concepts/trust-and-safety-classifier) Gate — drop unsafe fine-tuned GPT-4.1-nano
Inquiry Type (concepts/inquiry-type-classifier) Gate — drop out-of-scope fine-tuned GPT-4.1-nano
Content Source Selection Pick subset of sources to consult fine-tuned small
Keyword Generation Emit vertical-specific search terms fine-tuned small

Implementation substrate: async langchain chains. The post:

"We built the question analysis agents as asynchronous chains invoked through langchain, which meant that they can run in parallel and we just need to wait for the longest agent to complete. We also added early stopping to the components in the pipeline when the trust and safety classifier rejects the question to avoid waiting longer before responding."

Latency-budget impact: directly contributes to Yelp's p75 10-20 s → <3 s transformation. Source-selection and keyword- generation were split into two small models rather than one combined model — see patterns/split-source-selection-from-keyword-generation.

Why it works

  • Classifier-model latency (a few hundred ms on a fine-tuned small model) << retrieval + generation latency. Running four of them in parallel is roughly the cost of running one.
  • Fine-tuned small models are cheap enough to fan out. A frontier-model fan-out would have different economics.
  • Gate rejection is a common-enough case in internet-facing traffic that cancelling saves real tokens — it's not an edge optimization.

Failure modes

  • Gate classifier false positives reject legitimate questions. Mitigation: log rejected questions, re-label, re-fine-tune periodically. Yelp's fine-tune datasets heavily included "~50% questions that should be considered legitimate" as negatives specifically to control false- positive rate.
  • Source-selection model emits the wrong subset — the question is answerable but the chosen sources don't contain the evidence. Mitigation: evidence-relevance grader catches this offline.
  • Keyword generator returns keywords too vague or too specific. Yelp names this explicitly: "the core challenge was avoiding keywords that are too vague (match everything) or too specific (match nothing)." Mitigation: iterative prompt tuning against a labelled set.
  • Cancellation-race: classifier finishes emitting output just as cancellation arrives. The orchestration layer must tolerate either outcome cleanly.

Relation to sibling patterns

Seen in

Last updated · 550 distilled / 1,221 read