Skip to content

PATTERN Cited by 1 source

Multi-stage LLM pipeline over large context

Intent

When reasoning about a large document corpus with LLMs, chain multiple narrow single-objective LLM stages rather than stuffing all documents into a single large-context prompt. The multi-stage shape:

  • Bounds each stage's input size — avoids the lost in the middle failure mode.
  • Bounds each stage's output size — keeps outputs human-inspectable for curation.
  • Bounds each stage's objective — each prompt does one thing, so negative example discipline targets one failure mode at a time.
  • Enables deliberate model tiering — cheap / fast models for early stages, frontier tier for the synthesis stage.

Canonicalised by Zalando's datastore-team postmortem analysis pipeline (2025-09-24), which made the architectural trade-off explicit:

"We designed a multi-stage LLM pipeline instead of using high-end LLMs with large context windows. It is a deliberate design trade-off aimed at simplicity and reliability." (Source: sources/2025-09-24-zalando-dead-ends-or-data-goldmines-ai-powered-postmortem-analysis)

When to use

  • Corpus-scale workload (hundreds to tens of thousands of documents).
  • Strategic / offline altitude — the output feeds leadership decision-making, not a low-latency on-call path.
  • Reliability matters more than maximum capability — human curators must be able to inspect intermediate outputs.
  • Each document has internal structure you can extract independently, and only the cross-document synthesis requires aggregation.

When NOT to use

  • Genuinely unstructured reasoning where the model needs to hold the full corpus in context to reason across it. Pipeline staging forces a specific decomposition upfront.
  • Real-time / low-latency — pipeline stages add latency; a single large-context prompt may be faster end-to-end even if per-token cost is higher.
  • Small N (single-digit documents) — pipeline scaffolding isn't justified by payoff.

Structure

Five stages in the Zalando pipeline, generalisable:

Stage Objective Input scale Output scale
Extract (map) Per-document structured extraction 1 document Compact fielded summary
Classify Per-document tagging against a taxonomy 1 summary + taxonomy Tag set or None
Interpret Per-document causal digest (≤ N sentences) 1 classified summary ≤ 5-sentence digest
Synthesize (fold) Cross-document pattern discovery All digests One-pager pattern list
Act Convert pattern list to investment / decision recommendations Pattern list + auxiliary data Human-authored proposal

The first three stages are the map in a map-fold composition; the last two are the fold (first LLM, then human).

Structural properties

  • Each stage's prompt is a TELeR-maximal instance. See concepts/teler-prompt-framework. Specifically: single-turn, explicitly structured output, maximal level of details (task + schema + constraints + negative examples), and a role-implicit frame.
  • Each stage's output is a defined artefact class. Not free-form text — a fielded summary, a tag, a bounded digest, a one-pager. Artefact shape drives the prompt structure.
  • Each stage can be swapped independently. Zalando's pipeline went through at least three per-stage model generations (NotebookLM → open-source 3B/12B/27B → Claude Sonnet 4 on Bedrock) without changing the stage boundaries.
  • Human curation attaches to stages, not end-to-end. Curation focuses on the digest stage (map output) during development and on the one-pager stage (fold output) at maturity. See patterns/human-in-the-loop-quality-sampling for the rate-over-time schedule.

Participants

  • Stage LLMs. One per stage; can be same or different model. Zalando initially used 3B / 12B / 27B split across stages; current iteration uses Claude Sonnet 4 on Bedrock uniformly.
  • Stage prompts. One per stage — TELeR-maximal with stage-specific negative examples.
  • Intermediate artefact store. Typically a filesystem / object store holding the per-stage outputs. Lets human curators inspect before next stage, lets pipeline resume from any stage, lets outputs be reused across different fold prompts.
  • Human curators. 100% during development, 10–20% at maturity, human proofreading of final fold output always.

Consequences

  • Latency is additive across stages per document. Zalando report ~30 s per document on Claude Sonnet 4 for the full pipeline, "processing of annual data analysis in under 24 hours."
  • Cost is multiplicative. Every document is processed N times (once per map stage), so the per-document cost is ~N × single-prompt cost. Trade-off: avoided lost-in-the-middle failures + avoided human review at per-document granularity.
  • Debuggability is high. Any bad fold-stage output can be traced to the specific digest that's wrong; any bad digest can be traced to the specific summary it was built on; any bad summary can be traced to the postmortem source. Contrast: a single large-context prompt's output error has no locality.
  • Surface-attribution errors still compound. Each stage can commit a ~10% surface-attribution error at Claude Sonnet 4 tier; the pipeline does not eliminate this, it surfaces it in a form a human can catch.

Known uses

Last updated · 507 distilled / 1,218 read