PATTERN Cited by 1 source
Multi-stage LLM pipeline over large context¶
Intent¶
When reasoning about a large document corpus with LLMs, chain multiple narrow single-objective LLM stages rather than stuffing all documents into a single large-context prompt. The multi-stage shape:
- Bounds each stage's input size — avoids the lost in the middle failure mode.
- Bounds each stage's output size — keeps outputs human-inspectable for curation.
- Bounds each stage's objective — each prompt does one thing, so negative example discipline targets one failure mode at a time.
- Enables deliberate model tiering — cheap / fast models for early stages, frontier tier for the synthesis stage.
Canonicalised by Zalando's datastore-team postmortem analysis pipeline (2025-09-24), which made the architectural trade-off explicit:
"We designed a multi-stage LLM pipeline instead of using high-end LLMs with large context windows. It is a deliberate design trade-off aimed at simplicity and reliability." (Source: sources/2025-09-24-zalando-dead-ends-or-data-goldmines-ai-powered-postmortem-analysis)
When to use¶
- Corpus-scale workload (hundreds to tens of thousands of documents).
- Strategic / offline altitude — the output feeds leadership decision-making, not a low-latency on-call path.
- Reliability matters more than maximum capability — human curators must be able to inspect intermediate outputs.
- Each document has internal structure you can extract independently, and only the cross-document synthesis requires aggregation.
When NOT to use¶
- Genuinely unstructured reasoning where the model needs to hold the full corpus in context to reason across it. Pipeline staging forces a specific decomposition upfront.
- Real-time / low-latency — pipeline stages add latency; a single large-context prompt may be faster end-to-end even if per-token cost is higher.
- Small N (single-digit documents) — pipeline scaffolding isn't justified by payoff.
Structure¶
Five stages in the Zalando pipeline, generalisable:
| Stage | Objective | Input scale | Output scale |
|---|---|---|---|
| Extract (map) | Per-document structured extraction | 1 document | Compact fielded summary |
| Classify | Per-document tagging against a taxonomy | 1 summary + taxonomy | Tag set or None |
| Interpret | Per-document causal digest (≤ N sentences) | 1 classified summary | ≤ 5-sentence digest |
| Synthesize (fold) | Cross-document pattern discovery | All digests | One-pager pattern list |
| Act | Convert pattern list to investment / decision recommendations | Pattern list + auxiliary data | Human-authored proposal |
The first three stages are the map in a map-fold composition; the last two are the fold (first LLM, then human).
Structural properties¶
- Each stage's prompt is a TELeR-maximal instance. See concepts/teler-prompt-framework. Specifically: single-turn, explicitly structured output, maximal level of details (task + schema + constraints + negative examples), and a role-implicit frame.
- Each stage's output is a defined artefact class. Not free-form text — a fielded summary, a tag, a bounded digest, a one-pager. Artefact shape drives the prompt structure.
- Each stage can be swapped independently. Zalando's pipeline went through at least three per-stage model generations (NotebookLM → open-source 3B/12B/27B → Claude Sonnet 4 on Bedrock) without changing the stage boundaries.
- Human curation attaches to stages, not end-to-end. Curation focuses on the digest stage (map output) during development and on the one-pager stage (fold output) at maturity. See patterns/human-in-the-loop-quality-sampling for the rate-over-time schedule.
Participants¶
- Stage LLMs. One per stage; can be same or different model. Zalando initially used 3B / 12B / 27B split across stages; current iteration uses Claude Sonnet 4 on Bedrock uniformly.
- Stage prompts. One per stage — TELeR-maximal with stage-specific negative examples.
- Intermediate artefact store. Typically a filesystem / object store holding the per-stage outputs. Lets human curators inspect before next stage, lets pipeline resume from any stage, lets outputs be reused across different fold prompts.
- Human curators. 100% during development, 10–20% at maturity, human proofreading of final fold output always.
Consequences¶
- Latency is additive across stages per document. Zalando report ~30 s per document on Claude Sonnet 4 for the full pipeline, "processing of annual data analysis in under 24 hours."
- Cost is multiplicative. Every document is processed N times (once per map stage), so the per-document cost is ~N × single-prompt cost. Trade-off: avoided lost-in-the-middle failures + avoided human review at per-document granularity.
- Debuggability is high. Any bad fold-stage output can be traced to the specific digest that's wrong; any bad digest can be traced to the specific summary it was built on; any bad summary can be traced to the postmortem source. Contrast: a single large-context prompt's output error has no locality.
- Surface-attribution errors still compound. Each stage can commit a ~10% surface-attribution error at Claude Sonnet 4 tier; the pipeline does not eliminate this, it surfaces it in a form a human can catch.
Known uses¶
- systems/zalando-postmortem-analysis-pipeline — canonical. 5 stages (Summarization → Classification → Analyzer → Patterns → Opportunity), two-year operating horizon, disclosed outputs include 25% prevent-rate on follow-up S3 incidents. Pre-pipeline hallucination at up to 40% on small models; post-hardening on Claude Sonnet 4 rated "negligible", with the ~10% surface-attribution tail explicitly disclosed and mitigated by human curation.
- Adjacent instances already on the wiki (same shape, not same name):
- patterns/three-phase-llm-productionization — similar multi-phase ladder for productionising LLMs in search (Yelp).
- patterns/offline-teacher-online-student-distillation — map-style distillation pattern.
- patterns/rag-side-input-for-structured-extraction — structured-extraction pattern, single-stage not multi-stage.
Related¶
- concepts/map-fold-llm-pipeline — the functional composition primitive this pattern instantiates.
- concepts/lost-in-the-middle-effect — the failure mode this pattern was designed around.
- concepts/llm-hallucination — the cross-stage failure mode whose compounding the human-curation schedule addresses.
- concepts/surface-attribution-error — the residual failure mode human curators catch at the digest and one-pager stages.
- concepts/teler-prompt-framework — the per-stage prompt structure.
- patterns/negative-example-prompting — the per-stage prompt-hardening technique.
- patterns/human-in-the-loop-quality-sampling — the curation-rate schedule applied to the pipeline.
- systems/zalando-postmortem-analysis-pipeline — canonical production instance.