PATTERN Cited by 2 sources
Multi-step LLM extraction pipeline¶
The pattern¶
Decompose a large-scale LLM-driven information-extraction job into a sequence of narrow LLM invocations, each with a focused prompt over a focused input slice, instead of a single one-shot LLM call that tries to extract everything at once. Compose the steps with:
- Status-based checkpointing at per-record granularity, so re-runs resume from the failure point without re-paying the LLM cost on already-processed rows.
- A configurable extraction registry — each extraction method is a structured object (system prompt + extraction schema), so adding / changing extractions is configuration, not code.
- A star schema state model — central fact table holds per-record extraction state; dimension tables hold reference data. Queryable, evolvable, restartable.
The canonical wiki instance is the VF Match Foundational Data Refresh pipeline, which processes 25M+ web pages through OpenAI GPT models to build a global healthcare-facility / NGO catalog.
The pipeline shape¶
Raw web pages (Bright Data + Overture Maps)
│
▼
┌──────────────────────────────────┐
│ Step 1: Classify medical-relevance│
│ (cheap LLM call, narrow prompt) │
└──────────────────────────────────┘
│ relevant │ not-relevant
▼ ▼
┌────────────┐ ┌────────────┐
│ Step 2: │ │ DONE │
│ org-type │ │ (no further│
│ classifier │ │ cost) │
└────────────┘ └────────────┘
│
▼
┌──────────────────────────────────┐
│ Step 3: Extract specialties / │
│ equipment / procedures │
│ (expensive LLM call, schema- │
│ constrained output) │
└──────────────────────────────────┘
│
▼
┌──────────────────────────────────┐
│ Star-schema fact table: │
│ per-record state + │
│ per-step status columns + │
│ extracted-output FKs │
└──────────────────────────────────┘
(Source: sources/2026-05-20-databricks-virtue-foundation-medical-volunteers-72-countries)
Why this beats one-shot¶
- Cheap-step gating expensive-step. A medical-relevance flag on every web page is cheap; full specialty / equipment / procedure extraction is expensive. Gating the expensive step on the cheap step's output collapses the dominant cost on the negative class (most pages aren't medically relevant).
- Narrow-prompt precision. A prompt asking "is this medically relevant" alone outperforms the same model in a multi-decision prompt. Splitting prompts splits attention budget across decisions; you don't dilute one decision's accuracy with the other's.
- Independent prompt iteration. When the org-type classifier's precision drops, you tune that step's prompt without touching the relevance step.
- Inspectable failure modes. When the output is wrong, the step at which it went wrong is identifiable by status column.
- Dramatic token reduction. Verbatim from VF Match: "This approach dramatically reduces token consumption while focusing each model invocation on a narrow, high-precision task."
Required substrate¶
- Per-record state column. Each step's status (NOT_STARTED / RUNNING / DONE / FAILED) is a column on the fact table. See concepts/status-based-llm-pipeline-checkpointing.
- Idempotent step execution. Re-running a row through a step must produce the same output (or fail safely). Schema-constrained output (concepts/schema-constrained-llm-output) is the standard discipline.
- Configurable extraction registry. A map from step name → (system prompt, schema, model endpoint, retry policy). Adding a new extraction ("extract NGO funding sources") is a config row, not a code change.
- Orchestration substrate. Lakeflow Jobs / Airflow / similar — to run the steps in sequence with conditional branching, parallel execution where possible, and intelligent retry.
- Star-schema state model. concepts/star-schema — the fact-table-with-dimensions shape that makes queries against pipeline state trivial.
Composes with¶
- patterns/two-pass-classify-then-deep-extract — sibling pattern at scanned-document altitude (MapAid groundwater pipeline). A specialisation that adds intelligent sampling on the cheap pass.
- patterns/visual-first-document-extraction — orthogonal: what modality the steps process. Multi-step is how many calls; visual-first is what input per call.
- patterns/llm-judge-as-inline-pipeline-stage — composes cleanly: one of the steps in a multi-step pipeline can be an LLM judge that scores the previous step's output.
- patterns/sql-native-multimodal-llm-inference —
composes via Databricks AI
Functions (
ai_query) — each step expressed as a SQL function call inside a DataFrame transformation. - patterns/structural-deterministic-logical-llm-split — cousin pattern at code-migration altitude (Deutsche Börse Zeppelin Converter). Both decompose a heterogeneous task into a sequence; difference is the structural-vs-logical split is a binary stage decision while multi-step LLM extraction has N homogeneous steps with conditional gating.
When applies / doesn't fit¶
Applies when¶
- Workload is web-scale or document-scale (millions of records).
- LLM-call cost is the dominant pipeline cost.
- Extraction has a natural decomposition with at least one cheap classification step that gates an expensive extraction step.
- Pipeline must be resumable and incremental.
- Operators want to add new extraction targets without code deploys.
Doesn't fit when¶
- Workload is small (fits in one prompt's context window).
- Per-record cost is dominated by I/O, not LLM call.
- The decomposition is artificial — every input requires every step.
- Real-time / single-record use case (status-checkpointing is noise overhead).
Failure modes¶
- Step-boundary leaks. A too-aggressive cheap classifier filters out valid records before the expensive step sees them. Mitigation: tune classifier recall conservatively.
- State enum sprawl. Every new step adds new status values; without discipline the state model becomes a debugging hell. Mitigation: keep status enums small and reused across steps.
- Configurable-registry drift. When the extraction registry is changed, historical fact-table outputs may no longer match current schema. Mitigation: registry versioning + per-record schema-version FK.
- No closed quality-feedback loop. Without monitoring per-step precision / recall, regressions in upstream classifier silently poison downstream extraction. Mitigation: an inline LLM judge (patterns/llm-judge-as-inline-pipeline-stage) on each step.
Seen in¶
- sources/2026-05-20-databricks-virtue-foundation-medical-volunteers-72-countries — canonical wiki source. VF Match Foundational Data Refresh: three-step decomposition (classify medical relevance → identify org type → extract specialties / equipment / procedures), 25M+ web pages, status-based checkpointing, configurable extraction registry, star-schema state, Lakeflow Jobs orchestration of 15+ tasks with conditional branching + retry. Verbatim claim: "This approach dramatically reduces token consumption while focusing each model invocation on a narrow, high-precision task."
- sources/2026-05-11-databricks-unlocking-the-archives —
sibling instance at scanned-document altitude. MapAid
groundwater archive: two-pass classify-then-deep-extract is
a specialisation of multi-step extraction with intelligent
sampling on the cheap pass, multimodal
ai_querycalls per step, inline LLM judge per stage.
Related¶
- concepts/multi-step-llm-extraction — the core principle this pattern operationalises.
- concepts/status-based-llm-pipeline-checkpointing — the resumability sub-property.
- concepts/star-schema — the state-model substrate.
- concepts/schema-constrained-llm-output — the per-step output discipline.
- patterns/two-pass-classify-then-deep-extract — sibling pattern at document-extraction altitude.
- patterns/llm-judge-as-inline-pipeline-stage — composes for per-step quality monitoring.
- systems/databricks-ai-functions / systems/lakeflow-jobs — Databricks-stack substrate for this pattern.