CONCEPT Cited by 2 sources
Multi-step LLM extraction¶
Definition¶
Multi-step LLM extraction is the discipline of decomposing an information-extraction task into a sequence of narrow LLM invocations, each with a focused prompt over a focused input slice, instead of a single one-shot LLM call that tries to extract everything at once. The decomposition is canonicalised in the Databricks + Virtue Foundation FDR pipeline verbatim:
"Rather than attempting one-shot extraction, our pipeline breaks the task into targeted steps: classifying medical relevance, identifying organization type (either a medical facility or NGO), and extracting specialties, equipment, and procedures. This approach dramatically reduces token consumption while focusing each model invocation on a narrow, high-precision task."
Why one-shot fails at scale¶
- Long prompts dilute the model's attention. A single prompt asking for "is this medically relevant; if so what kind; what specialties; what equipment; what procedures" dilutes the model's focus across five sub-tasks. Each one done worse than if asked alone.
- Token cost scales with the longest prompt. Even when 95% of pages don't need the full extraction (most are not medically relevant), a one-shot prompt pays the full token cost on every page.
- No early-exit on cheap negatives. A one-shot extraction has no way to skip the expensive specialty / equipment extraction for pages that aren't medically relevant — every page goes through the full pipeline.
- Failure modes get mixed. When one-shot output is wrong, you don't know whether the model misclassified relevance, mis-typed the org, or just hallucinated a specialty. Diagnosis is hard.
- Schema drift across calls. A monolithic prompt's output schema is the union of all sub-schemas; small format changes in any sub-task can break downstream parsing.
What the multi-step shape buys¶
- Narrow, high-precision invocations. Each step's prompt optimises for one decision; no attention-budget split.
- Cheap-step gating expensive-step. A cheap classification step (medical-relevance flag) gates the expensive extraction step, so only the qualifying records pay the full extraction cost. The MapAid groundwater pipeline uses the same pattern: a classification pass routes only ~50% water-flagged pages to the expensive multimodal extraction pass.
- Independent prompt iteration. When precision drops on the org-type step, you tune that step's prompt without breaking the relevance step's prompt. Specialisation supports targeted iteration.
- Dramatic token reduction. Verbatim from the post: "This approach dramatically reduces token consumption while focusing each model invocation on a narrow, high-precision task."
- Inspectable failure modes. When the output is wrong, the step at which it went wrong is identifiable.
Three sub-properties of a robust multi-step pipeline¶
The VF Match FDR pipeline canonicalises three sub-properties that make multi-step LLM extraction production-grade:
- Status- based checkpointing. Each record's extraction state is tracked in a state column; re-runs resume from the failure point without re-paying LLM cost on already-processed rows.
- Configurable extraction registry. Each extraction method is a structured object (system prompt + extraction schema); adding a new extraction is configuration, not code.
- Star schema state model. Per-record extraction state lives in a central fact table with foreign keys to dimension tables — query-friendly, reproducible, restartable.
These three properties compose into the patterns/multi-step-llm-extraction-pipeline pattern.
Relationship to other concepts¶
- concepts/schema-constrained-llm-output — each step in the multi-step pipeline emits structured output via a per-step schema. Multi-step is how many calls; schema- constrained is what shape per call.
- concepts/intelligent-document-sampling — a related upstream optimisation (sample expensive computation; full extraction only on the qualifying subset). The MapAid source pairs the two; FDR uses the multi-step decomposition without explicit sampling because the input space is web pages, not sampled multi-page PDFs.
- concepts/multimodal-document-understanding — orthogonal: multi-step is about how many calls; multimodal is about what modality per call. They compose freely.
Failure modes¶
- Step boundaries leak. When a cheap classification step is too aggressive on the negative class, the downstream extraction step sees nothing useful and emits silent garbage. Mitigation: tune classifier recall to be slightly conservative.
- State machine grows unbounded. Each new extraction step adds new status values; without discipline, the state model becomes a debugging-hell of one-off codes.
- No closed feedback loop. Without monitoring per-step quality, regressions in upstream classifier accuracy silently poison downstream extraction.
Seen in¶
- sources/2026-05-20-databricks-virtue-foundation-medical-volunteers-72-countries — canonical wiki source. VF Match FDR's three-step decomposition (classify medical relevance → identify org type → extract specialties / equipment / procedures) over 25M+ web pages, with status-based checkpointing, configurable extraction registry, and star-schema state. "Dramatically reduces token consumption while focusing each model invocation on a narrow, high-precision task."
- sources/2026-05-11-databricks-unlocking-the-archives — sibling source at scanned-document altitude. MapAid's two-pass classify-then-deep-extract is a specialisation of multi-step LLM extraction with intelligent sampling on the cheap pass.
Related¶
- patterns/multi-step-llm-extraction-pipeline — the named pattern this concept is the core principle of.
- patterns/two-pass-classify-then-deep-extract — sibling pattern at document-extraction altitude.
- concepts/status-based-llm-pipeline-checkpointing — the resumability primitive multi-step extraction depends on at scale.
- concepts/star-schema — the state-model substrate.
- concepts/schema-constrained-llm-output — the per-step output discipline.
- systems/databricks-ai-functions — the SQL-callable LLM inference primitive Databricks customers use to express multi-step extraction inline with table data.