CONCEPT Cited by 3 sources

LLM cascade¶

Definition¶

An LLM cascade is a cost-vs-quality routing pattern in which the cheapest adequate LLM runs first, and a progressively more expensive, more capable LLM runs only if the cheap one's output doesn't clear a confidence threshold.

Structurally: cheap_LLM(x) → confidence(x) → if confidence ≥ τ return output else strong_LLM(x). The cascade exploits the fact that most inputs are easy (cheap LLM suffices) and only a tail of hard inputs needs the expensive model.

The post's reference [3] ("LARGE LANGUAGE MODEL CASCADES WITH MIXTURE OF THOUGHT REPRESENTATIONS FOR COST-EFFICIENT REASONING") and [1] ("FrugalGPT") are the canonical academic formulations.

Why it works¶

Two empirical observations compound:

Most inputs are structurally easy. For LLM extraction, classification, or generation tasks, the distribution of difficulty is heavy-tailed: ~70-90% of inputs can be handled by a model ~10× cheaper than the "best" model, and only the hard tail actually needs the expensive one.
A confidence signal correlates with difficulty. If the cheap LLM outputs a confidence score (via self-verification or a separate classifier head), low-confidence outputs are a good proxy for "expensive LLM would do better" — so thresholding gives you a cheap router.

The result: most inputs cost O(cheap-LLM) tokens, and the tail costs O(cheap + expensive). Aggregate cost is closer to cheap-LLM-only than to expensive-LLM-only, at quality close to expensive-LLM-only.

Per-attribute / per-task cascade selection is load-bearing¶

From PARSE: the same cheap LLM that gives equivalent quality at 70% cost reduction for a simple attribute (organic claim) suffers a 60% accuracy drop on a hard attribute (low_sugar claim). A platform that ships one cascade configuration globally leaves money on the table for the easy tasks AND quality on the table for the hard ones. PARSE's answer: expose cascade config per attribute — a per-task (not per-platform) model-size decision.

From the post:

"For simpler attributes, a cheaper but less powerful LLM delivered similar quality to more powerful ones at a 70% cost reduction. However, for difficult attributes, the less powerful LLMs suffered from a 60% accuracy drop. This emphasizes the importance of selecting the right extraction model to balance cost and quality effectively."

Typical cascade topologies¶

Two-tier: cheap → expensive (the canonical form).
Three-tier: tiny on-device / fine-tuned → hosted cheap → hosted frontier.
Modality cascade: text-only LLM first; escalate to multi-modal LLM only if text alone is ambiguous — the extraction-focused form used for multi-modal attribute extraction.
Self-verification gate: one-tier cascade where the "escalation" is to a [[patterns/low-confidence-to-human- review|human reviewer]] rather than another LLM.

Tradeoffs / gotchas¶

Cascade only pays off if the cheap model's failures are concentrated on hard inputs. If the cheap model randomly fails on 15% of inputs regardless of difficulty, the cascade has to escalate 15% for no correlated quality gain — the expensive LLM runs on easy inputs.
Threshold tuning is a per-task exercise. Too high → escalation on most inputs (loses the cost win). Too low → ship bad outputs from the cheap model (loses the quality win).
Latency in the escalation path. Inputs that escalate pay cheap + expensive latency, not just cheap — the escalation fraction and escalation latency both matter for p95.
Confidence-score calibration is critical. A well-calibrated cheap model makes the cascade work; a poorly-calibrated one effectively randomises the router. See concepts/llm-self-verification calibration caveat.
Observability: track escalation rate per task. If it drifts upward, either the cheap model degraded, the input distribution changed, or the threshold is wrong.

Seen in¶

sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms — canonical wiki instance for structured-extraction cost reduction. Instacart's PARSE exposes the LLM choice (and an "LLM cascade algorithm" per the post's own language) as a per-attribute config so that simple attributes run on cheap LLMs and hard attributes reserve the expensive model. 70% cost reduction on organic; 60% accuracy drop on low_sugar with cheap-only, motivating the cascade.
sources/2025-02-04-yelp-search-query-understanding-with-llms — three-tier cascade variant at query-understanding altitude. Yelp's cascade escalates not on confidence but on cache coverage: (1) head cache (expensive LLM output) → (2) offline-batch fine-tuned GPT-4o-mini (~100× cost reduction vs. direct GPT-4) → (3) realtime BERT / T5 for never-seen- before tail queries. Orthogonal to Instacart's confidence- based cascade — Yelp's tiers route on cache-hit / cache-miss / never-seen, not on the cheap model's own uncertainty. Canonical wiki instance of a cache-based (rather than confidence-based) cascade topology.
sources/2026-02-26-instacart-our-early-journey-to-transform-discovery-recommendations-with-llms — multi-phase task-decomposition variant. The "cascade" word applies in the task- decomposition sense (different sub-tasks to different phases) rather than the cost-routing sense of this page. See patterns/top-down-cascaded-page-generation for the full shape. Confirms the generalisation: "cascade" is an overloaded term — this page covers the cost-vs-quality routing axis; the Instacart post is the multi-phase decomposition axis. The two can compose (any phase of a decomposition cascade can internally be a cost-routing cascade).

concepts/llm-self-verification — produces the confidence score the cascade router thresholds on.
concepts/model-agnostic-ml-platform — the platform stance that makes model-swap cheap enough for per-task cascade config.
patterns/llm-attribute-extraction-platform — the production pattern in which cascade is a pipeline primitive.
patterns/low-confidence-to-human-review — the human- escalation cascade variant.
systems/instacart-parse — canonical production instance.