Skip to content

PATTERN Cited by 2 sources

Multi-step LLM extraction pipeline

The pattern

Decompose a large-scale LLM-driven information-extraction job into a sequence of narrow LLM invocations, each with a focused prompt over a focused input slice, instead of a single one-shot LLM call that tries to extract everything at once. Compose the steps with:

  1. Status-based checkpointing at per-record granularity, so re-runs resume from the failure point without re-paying the LLM cost on already-processed rows.
  2. A configurable extraction registry — each extraction method is a structured object (system prompt + extraction schema), so adding / changing extractions is configuration, not code.
  3. A star schema state model — central fact table holds per-record extraction state; dimension tables hold reference data. Queryable, evolvable, restartable.

The canonical wiki instance is the VF Match Foundational Data Refresh pipeline, which processes 25M+ web pages through OpenAI GPT models to build a global healthcare-facility / NGO catalog.

The pipeline shape

            Raw web pages (Bright Data + Overture Maps)
            ┌──────────────────────────────────┐
            │ Step 1: Classify medical-relevance│
            │  (cheap LLM call, narrow prompt)  │
            └──────────────────────────────────┘
                  │ relevant         │ not-relevant
                  ▼                  ▼
            ┌────────────┐    ┌────────────┐
            │ Step 2:    │    │ DONE       │
            │ org-type   │    │ (no further│
            │ classifier │    │  cost)     │
            └────────────┘    └────────────┘
            ┌──────────────────────────────────┐
            │ Step 3: Extract specialties /     │
            │  equipment / procedures           │
            │  (expensive LLM call, schema-     │
            │   constrained output)             │
            └──────────────────────────────────┘
            ┌──────────────────────────────────┐
            │ Star-schema fact table:           │
            │  per-record state +               │
            │  per-step status columns +        │
            │  extracted-output FKs             │
            └──────────────────────────────────┘

(Source: sources/2026-05-20-databricks-virtue-foundation-medical-volunteers-72-countries)

Why this beats one-shot

  • Cheap-step gating expensive-step. A medical-relevance flag on every web page is cheap; full specialty / equipment / procedure extraction is expensive. Gating the expensive step on the cheap step's output collapses the dominant cost on the negative class (most pages aren't medically relevant).
  • Narrow-prompt precision. A prompt asking "is this medically relevant" alone outperforms the same model in a multi-decision prompt. Splitting prompts splits attention budget across decisions; you don't dilute one decision's accuracy with the other's.
  • Independent prompt iteration. When the org-type classifier's precision drops, you tune that step's prompt without touching the relevance step.
  • Inspectable failure modes. When the output is wrong, the step at which it went wrong is identifiable by status column.
  • Dramatic token reduction. Verbatim from VF Match: "This approach dramatically reduces token consumption while focusing each model invocation on a narrow, high-precision task."

Required substrate

  • Per-record state column. Each step's status (NOT_STARTED / RUNNING / DONE / FAILED) is a column on the fact table. See concepts/status-based-llm-pipeline-checkpointing.
  • Idempotent step execution. Re-running a row through a step must produce the same output (or fail safely). Schema-constrained output (concepts/schema-constrained-llm-output) is the standard discipline.
  • Configurable extraction registry. A map from step name → (system prompt, schema, model endpoint, retry policy). Adding a new extraction ("extract NGO funding sources") is a config row, not a code change.
  • Orchestration substrate. Lakeflow Jobs / Airflow / similar — to run the steps in sequence with conditional branching, parallel execution where possible, and intelligent retry.
  • Star-schema state model. concepts/star-schema — the fact-table-with-dimensions shape that makes queries against pipeline state trivial.

Composes with

When applies / doesn't fit

Applies when

  • Workload is web-scale or document-scale (millions of records).
  • LLM-call cost is the dominant pipeline cost.
  • Extraction has a natural decomposition with at least one cheap classification step that gates an expensive extraction step.
  • Pipeline must be resumable and incremental.
  • Operators want to add new extraction targets without code deploys.

Doesn't fit when

  • Workload is small (fits in one prompt's context window).
  • Per-record cost is dominated by I/O, not LLM call.
  • The decomposition is artificial — every input requires every step.
  • Real-time / single-record use case (status-checkpointing is noise overhead).

Failure modes

  • Step-boundary leaks. A too-aggressive cheap classifier filters out valid records before the expensive step sees them. Mitigation: tune classifier recall conservatively.
  • State enum sprawl. Every new step adds new status values; without discipline the state model becomes a debugging hell. Mitigation: keep status enums small and reused across steps.
  • Configurable-registry drift. When the extraction registry is changed, historical fact-table outputs may no longer match current schema. Mitigation: registry versioning + per-record schema-version FK.
  • No closed quality-feedback loop. Without monitoring per-step precision / recall, regressions in upstream classifier silently poison downstream extraction. Mitigation: an inline LLM judge (patterns/llm-judge-as-inline-pipeline-stage) on each step.

Seen in

  • sources/2026-05-20-databricks-virtue-foundation-medical-volunteers-72-countriescanonical wiki source. VF Match Foundational Data Refresh: three-step decomposition (classify medical relevance → identify org type → extract specialties / equipment / procedures), 25M+ web pages, status-based checkpointing, configurable extraction registry, star-schema state, Lakeflow Jobs orchestration of 15+ tasks with conditional branching + retry. Verbatim claim: "This approach dramatically reduces token consumption while focusing each model invocation on a narrow, high-precision task."
  • sources/2026-05-11-databricks-unlocking-the-archives — sibling instance at scanned-document altitude. MapAid groundwater archive: two-pass classify-then-deep-extract is a specialisation of multi-step extraction with intelligent sampling on the cheap pass, multimodal ai_query calls per step, inline LLM judge per stage.
Last updated · 542 distilled / 1,571 read