PATTERN Cited by 1 source

Multi-stage LLM pipeline over large context¶

Intent¶

When reasoning about a large document corpus with LLMs, chain multiple narrow single-objective LLM stages rather than stuffing all documents into a single large-context prompt. The multi-stage shape:

Bounds each stage's input size — avoids the lost in the middle failure mode.
Bounds each stage's output size — keeps outputs human-inspectable for curation.
Bounds each stage's objective — each prompt does one thing, so negative example discipline targets one failure mode at a time.
Enables deliberate model tiering — cheap / fast models for early stages, frontier tier for the synthesis stage.

Canonicalised by Zalando's datastore-team postmortem analysis pipeline (2025-09-24), which made the architectural trade-off explicit:

"We designed a multi-stage LLM pipeline instead of using high-end LLMs with large context windows. It is a deliberate design trade-off aimed at simplicity and reliability." (Source: sources/2025-09-24-zalando-dead-ends-or-data-goldmines-ai-powered-postmortem-analysis)

When to use¶

Corpus-scale workload (hundreds to tens of thousands of documents).
Strategic / offline altitude — the output feeds leadership decision-making, not a low-latency on-call path.
Reliability matters more than maximum capability — human curators must be able to inspect intermediate outputs.
Each document has internal structure you can extract independently, and only the cross-document synthesis requires aggregation.

When NOT to use¶

Genuinely unstructured reasoning where the model needs to hold the full corpus in context to reason across it. Pipeline staging forces a specific decomposition upfront.
Real-time / low-latency — pipeline stages add latency; a single large-context prompt may be faster end-to-end even if per-token cost is higher.
Small N (single-digit documents) — pipeline scaffolding isn't justified by payoff.

Structure¶

Five stages in the Zalando pipeline, generalisable:

Stage	Objective	Input scale	Output scale
Extract (map)	Per-document structured extraction	1 document	Compact fielded summary
Classify	Per-document tagging against a taxonomy	1 summary + taxonomy	Tag set or `None`
Interpret	Per-document causal digest (≤ N sentences)	1 classified summary	≤ 5-sentence digest
Synthesize (fold)	Cross-document pattern discovery	All digests	One-pager pattern list
Act	Convert pattern list to investment / decision recommendations	Pattern list + auxiliary data	Human-authored proposal

The first three stages are the map in a map-fold composition; the last two are the fold (first LLM, then human).

Structural properties¶

Each stage's prompt is a TELeR-maximal instance. See concepts/teler-prompt-framework. Specifically: single-turn, explicitly structured output, maximal level of details (task + schema + constraints + negative examples), and a role-implicit frame.
Each stage's output is a defined artefact class. Not free-form text — a fielded summary, a tag, a bounded digest, a one-pager. Artefact shape drives the prompt structure.
Each stage can be swapped independently. Zalando's pipeline went through at least three per-stage model generations (NotebookLM → open-source 3B/12B/27B → Claude Sonnet 4 on Bedrock) without changing the stage boundaries.
Human curation attaches to stages, not end-to-end. Curation focuses on the digest stage (map output) during development and on the one-pager stage (fold output) at maturity. See patterns/human-in-the-loop-quality-sampling for the rate-over-time schedule.

Participants¶

Stage LLMs. One per stage; can be same or different model. Zalando initially used 3B / 12B / 27B split across stages; current iteration uses Claude Sonnet 4 on Bedrock uniformly.
Stage prompts. One per stage — TELeR-maximal with stage-specific negative examples.
Intermediate artefact store. Typically a filesystem / object store holding the per-stage outputs. Lets human curators inspect before next stage, lets pipeline resume from any stage, lets outputs be reused across different fold prompts.
Human curators. 100% during development, 10–20% at maturity, human proofreading of final fold output always.

Consequences¶

Latency is additive across stages per document. Zalando report ~30 s per document on Claude Sonnet 4 for the full pipeline, "processing of annual data analysis in under 24 hours."
Cost is multiplicative. Every document is processed N times (once per map stage), so the per-document cost is ~N × single-prompt cost. Trade-off: avoided lost-in-the-middle failures + avoided human review at per-document granularity.
Debuggability is high. Any bad fold-stage output can be traced to the specific digest that's wrong; any bad digest can be traced to the specific summary it was built on; any bad summary can be traced to the postmortem source. Contrast: a single large-context prompt's output error has no locality.
Surface-attribution errors still compound. Each stage can commit a ~10% surface-attribution error at Claude Sonnet 4 tier; the pipeline does not eliminate this, it surfaces it in a form a human can catch.

Known uses¶

systems/zalando-postmortem-analysis-pipeline — canonical. 5 stages (Summarization → Classification → Analyzer → Patterns → Opportunity), two-year operating horizon, disclosed outputs include 25% prevent-rate on follow-up S3 incidents. Pre-pipeline hallucination at up to 40% on small models; post-hardening on Claude Sonnet 4 rated "negligible", with the ~10% surface-attribution tail explicitly disclosed and mitigated by human curation.
Adjacent instances already on the wiki (same shape, not same name):
patterns/three-phase-llm-productionization — similar multi-phase ladder for productionising LLMs in search (Yelp).
patterns/offline-teacher-online-student-distillation — map-style distillation pattern.
patterns/rag-side-input-for-structured-extraction — structured-extraction pattern, single-stage not multi-stage.

concepts/map-fold-llm-pipeline — the functional composition primitive this pattern instantiates.
concepts/lost-in-the-middle-effect — the failure mode this pattern was designed around.
concepts/llm-hallucination — the cross-stage failure mode whose compounding the human-curation schedule addresses.
concepts/surface-attribution-error — the residual failure mode human curators catch at the digest and one-pager stages.
concepts/teler-prompt-framework — the per-stage prompt structure.
patterns/negative-example-prompting — the per-stage prompt-hardening technique.
patterns/human-in-the-loop-quality-sampling — the curation-rate schedule applied to the pipeline.
systems/zalando-postmortem-analysis-pipeline — canonical production instance.