PATTERN Cited by 2 sources

Multi-step LLM extraction pipeline¶

The pattern¶

Decompose a large-scale LLM-driven information-extraction job into a sequence of narrow LLM invocations, each with a focused prompt over a focused input slice, instead of a single one-shot LLM call that tries to extract everything at once. Compose the steps with:

Status-based checkpointing at per-record granularity, so re-runs resume from the failure point without re-paying the LLM cost on already-processed rows.
A configurable extraction registry — each extraction method is a structured object (system prompt + extraction schema), so adding / changing extractions is configuration, not code.
A star schema state model — central fact table holds per-record extraction state; dimension tables hold reference data. Queryable, evolvable, restartable.

The canonical wiki instance is the VF Match Foundational Data Refresh pipeline, which processes 25M+ web pages through OpenAI GPT models to build a global healthcare-facility / NGO catalog.

The pipeline shape¶

            Raw web pages (Bright Data + Overture Maps)
                          │
                          ▼
            ┌──────────────────────────────────┐
            │ Step 1: Classify medical-relevance│
            │  (cheap LLM call, narrow prompt)  │
            └──────────────────────────────────┘
                  │ relevant         │ not-relevant
                  ▼                  ▼
            ┌────────────┐    ┌────────────┐
            │ Step 2:    │    │ DONE       │
            │ org-type   │    │ (no further│
            │ classifier │    │  cost)     │
            └────────────┘    └────────────┘
                  │
                  ▼
            ┌──────────────────────────────────┐
            │ Step 3: Extract specialties /     │
            │  equipment / procedures           │
            │  (expensive LLM call, schema-     │
            │   constrained output)             │
            └──────────────────────────────────┘
                  │
                  ▼
            ┌──────────────────────────────────┐
            │ Star-schema fact table:           │
            │  per-record state +               │
            │  per-step status columns +        │
            │  extracted-output FKs             │
            └──────────────────────────────────┘

(Source: sources/2026-05-20-databricks-virtue-foundation-medical-volunteers-72-countries)

Why this beats one-shot¶

Cheap-step gating expensive-step. A medical-relevance flag on every web page is cheap; full specialty / equipment / procedure extraction is expensive. Gating the expensive step on the cheap step's output collapses the dominant cost on the negative class (most pages aren't medically relevant).
Narrow-prompt precision. A prompt asking "is this medically relevant" alone outperforms the same model in a multi-decision prompt. Splitting prompts splits attention budget across decisions; you don't dilute one decision's accuracy with the other's.
Independent prompt iteration. When the org-type classifier's precision drops, you tune that step's prompt without touching the relevance step.
Inspectable failure modes. When the output is wrong, the step at which it went wrong is identifiable by status column.
Dramatic token reduction. Verbatim from VF Match: "This approach dramatically reduces token consumption while focusing each model invocation on a narrow, high-precision task."

Required substrate¶

Per-record state column. Each step's status (NOT_STARTED / RUNNING / DONE / FAILED) is a column on the fact table. See concepts/status-based-llm-pipeline-checkpointing.
Idempotent step execution. Re-running a row through a step must produce the same output (or fail safely). Schema-constrained output (concepts/schema-constrained-llm-output) is the standard discipline.
Configurable extraction registry. A map from step name → (system prompt, schema, model endpoint, retry policy). Adding a new extraction ("extract NGO funding sources") is a config row, not a code change.
Orchestration substrate. Lakeflow Jobs / Airflow / similar — to run the steps in sequence with conditional branching, parallel execution where possible, and intelligent retry.
Star-schema state model. concepts/star-schema — the fact-table-with-dimensions shape that makes queries against pipeline state trivial.

Composes with¶

patterns/two-pass-classify-then-deep-extract — sibling pattern at scanned-document altitude (MapAid groundwater pipeline). A specialisation that adds intelligent sampling on the cheap pass.
patterns/visual-first-document-extraction — orthogonal: what modality the steps process. Multi-step is how many calls; visual-first is what input per call.
patterns/llm-judge-as-inline-pipeline-stage — composes cleanly: one of the steps in a multi-step pipeline can be an LLM judge that scores the previous step's output.
patterns/sql-native-multimodal-llm-inference — composes via Databricks AI Functions (ai_query) — each step expressed as a SQL function call inside a DataFrame transformation.
patterns/structural-deterministic-logical-llm-split — cousin pattern at code-migration altitude (Deutsche Börse Zeppelin Converter). Both decompose a heterogeneous task into a sequence; difference is the structural-vs-logical split is a binary stage decision while multi-step LLM extraction has N homogeneous steps with conditional gating.

When applies / doesn't fit¶

Applies when¶

Workload is web-scale or document-scale (millions of records).
LLM-call cost is the dominant pipeline cost.
Extraction has a natural decomposition with at least one cheap classification step that gates an expensive extraction step.
Pipeline must be resumable and incremental.
Operators want to add new extraction targets without code deploys.

Doesn't fit when¶

Workload is small (fits in one prompt's context window).
Per-record cost is dominated by I/O, not LLM call.
The decomposition is artificial — every input requires every step.
Real-time / single-record use case (status-checkpointing is noise overhead).

Failure modes¶

Step-boundary leaks. A too-aggressive cheap classifier filters out valid records before the expensive step sees them. Mitigation: tune classifier recall conservatively.
State enum sprawl. Every new step adds new status values; without discipline the state model becomes a debugging hell. Mitigation: keep status enums small and reused across steps.
Configurable-registry drift. When the extraction registry is changed, historical fact-table outputs may no longer match current schema. Mitigation: registry versioning + per-record schema-version FK.
No closed quality-feedback loop. Without monitoring per-step precision / recall, regressions in upstream classifier silently poison downstream extraction. Mitigation: an inline LLM judge (patterns/llm-judge-as-inline-pipeline-stage) on each step.

Seen in¶

sources/2026-05-20-databricks-virtue-foundation-medical-volunteers-72-countries — canonical wiki source. VF Match Foundational Data Refresh: three-step decomposition (classify medical relevance → identify org type → extract specialties / equipment / procedures), 25M+ web pages, status-based checkpointing, configurable extraction registry, star-schema state, Lakeflow Jobs orchestration of 15+ tasks with conditional branching + retry. Verbatim claim: "This approach dramatically reduces token consumption while focusing each model invocation on a narrow, high-precision task."
sources/2026-05-11-databricks-unlocking-the-archives — sibling instance at scanned-document altitude. MapAid groundwater archive: two-pass classify-then-deep-extract is a specialisation of multi-step extraction with intelligent sampling on the cheap pass, multimodal ai_query calls per step, inline LLM judge per stage.

concepts/multi-step-llm-extraction — the core principle this pattern operationalises.
concepts/status-based-llm-pipeline-checkpointing — the resumability sub-property.
concepts/star-schema — the state-model substrate.
concepts/schema-constrained-llm-output — the per-step output discipline.
patterns/two-pass-classify-then-deep-extract — sibling pattern at document-extraction altitude.
patterns/llm-judge-as-inline-pipeline-stage — composes for per-step quality monitoring.
systems/databricks-ai-functions / systems/lakeflow-jobs — Databricks-stack substrate for this pattern.