CONCEPT Cited by 1 source
Cascaded LLM generation¶
Definition¶
Cascaded LLM generation is the architectural pattern of decomposing a single generative task into multiple sequential LLM phases, each with a narrower input, a targeted prompt, and potentially a different model — rather than handing the whole task to one all-in-one LLM call.
The name contrasts with two adjacent wiki concepts:
- concepts/llm-cascade — a cost-vs-quality routing cascade where a cheap LLM runs first and an expensive one runs only on low-confidence outputs. Same "cascade" word, different axis.
- Monolithic generation — one LLM call with the full task context.
Cascaded LLM generation is about decomposing the task itself, not about routing between models by confidence. Each phase has a distinct responsibility and a distinct context.
Shape¶
[raw input] ──► LLM phase 1 ──► intermediate artefact 1
│
▼
LLM phase 2 ──► intermediate artefact 2
│
▼
LLM phase 3 ──► final output
Each phase can use a different model, a different prompt, a different retrieval strategy, and a different evaluation gate. The intermediate artefacts are the seams.
Why decomposition is a cost + quality move¶
The 2026-02-26 Instacart post's most reusable articulation:
"We ultimately found great value in decomposing generation into multiple targeted tasks. This opened the door to using retrieval-augmented generation (RAG) and other techniques that aren't feasible in a single-step model, enabling us to achieve higher quality while improving cost efficiency."
Three specific opportunities cascade-decomposition unlocks:
- RAG candidate pruning between phases — Phase 1 emits freeform concepts, Phase 2 uses embedding similarity to prune a large candidate corpus, Phase 2's LLM sees only the pruned subset. Not feasible in a single-step design because the concept emission and the candidate selection happen in the same forward pass.
- Teacher-student distillation per-phase — different phases have different quality + cost profiles. Phase 1 can use a frontier model (smaller per-user cost, structural output); Phase 2 can be a distilled smaller model optimized for the narrower task. Single-step design forces one model choice for everything.
- Per-phase evaluation + filtering — quality gates can be inserted between phases at their natural decision boundaries. LLM-as-judge at Phase 1 output; cross-encoder filtering at Phase 3 output. A monolithic generator has no intermediate seams where filters can plug in.
The Instacart Shopping Hub instance¶
Source: sources/2026-02-26-instacart-our-early-journey-to-transform-discovery-recommendations-with-llms
Instacart's 4-phase Shopping Hub cascade is the canonical wiki instance:
- Phase 1: page design + theme generation (LLM + constrained decoding) → ordered themes + derived signals.
- Phase 2: retrieval keyword generation (fine-tuned teacher- student student + RAG candidate pruning) → retrieval-compatible descriptors.
- Phase 3: quality + diversity filtering (embedding dedup + LLM-as-judge + DeBERTa cross-encoder + policy guardrails).
- Phase 4: existing ranking stack (unchanged).
Phases 1-3 form the generative content pipeline; Phase 4 is pre-existing infrastructure. See patterns/top-down-cascaded-page-generation.
The explicit economic number: the cascade-specific RAG-pruning step cuts all-in generation cost 15-20% per generation vs the single-step design that would have to pass the full 300K-term keyword corpus to maintain precision.
When cascaded generation fits¶
- The task decomposes cleanly along sub-task boundaries. Structure generation → content generation → filtering is clean; some tasks don't have clean seams.
- Different sub-tasks have different cost + quality profiles. Some sub-tasks benefit from frontier models; others run fine on distilled small models. Decomposition lets you match the model to the sub-task.
- Cost is a first-order concern. Decomposition's overhead pays off when RAG / distillation / per-phase filtering reduce cost substantially.
- Intermediate artefacts have independent value. Phase 1's themes are cacheable, reusable, inspectable independently from the final output — worth having as first-class entities.
When monolithic generation is better¶
- Sub-task boundaries are artificial. Forced decomposition where the model benefits from seeing everything jointly.
- Latency dominates cost. Sequential LLM phases sum latency; a single call is one forward pass.
- Operational simplicity dominates. One model, one prompt, one deploy is easier to maintain than N phases each with its own lifecycle.
Failure modes¶
- Inter-phase context loss. Phase 2 doesn't have all the context Phase 1 had; quality drop on edge cases that needed the full joint view.
- Inter-phase cascading errors. Phase 1's mistake propagates into Phase 2; no way to recover downstream.
- Latency summation. Each phase adds its own latency; the cascade is slower than a single call by 2-4×.
- Operational complexity. Each phase is a separate model lifecycle; deploys, A/B tests, and incidents multiply.
Relation to sibling concepts¶
- concepts/llm-cascade — the cost-routing sibling. Same word "cascade," different axis. LLM cascade routes the same task between cheap + expensive models by confidence; cascaded LLM generation routes different sub-tasks of a decomposed task to different phases. The two can compose (each phase of a cascaded-generation pipeline can internally be an LLM cascade).
- concepts/retrieval-augmented-generation — RAG is usually described at the single-call level; cascaded generation extends RAG to the inter-phase level, where retrieval happens between phases using intermediate phase-1 outputs as the query.
- concepts/generative-recommendations — the most common production domain for cascaded generation today.
- concepts/cascades-llm-inference — latency-optimization cascade at the inference layer (drafter + expert). Different axis from task decomposition but closely related name.
Seen in¶
- sources/2026-02-26-instacart-our-early-journey-to-transform-discovery-recommendations-with-llms — canonical wiki instance at Instacart Shopping Hub. Four-phase cascade; RAG + teacher-student + cross-encoder filtering are the cost-shape moves the decomposition enables.
Related¶
- concepts/top-down-vs-bottoms-up-generation — the design choice that determines what the cascade looks like.
- concepts/llm-cascade — the cost-routing sibling.
- concepts/generative-recommendations — the domain.
- concepts/retrieval-augmented-generation — inter-phase RAG.
- patterns/top-down-cascaded-page-generation — canonical production pattern.
- patterns/rag-candidate-pruning-cascade — the specific inter-phase cost-shape move.
- patterns/teacher-student-model-compression — per-phase model-size move.
- systems/instacart-generative-recommendations-platform — canonical production consumer.
- companies/instacart