PATTERN Cited by 1 source

Two-Pass Classify-Then-Deep-Extract¶

Two-Pass Classify-Then-Deep-Extract is the document-pipeline shape where the corpus is processed in two phases at different cost profiles:

Cheap classification pass over every document with sampled pages, producing tags + a relevance/routing flag.
Expensive extraction pass over only the documents the flag selected, processing every page for structured-record extraction.

The pattern is a budget-allocation move: pay full per-page extraction cost only on the subset of the corpus that justifies it.

Problem¶

LLM-driven document extraction is expensive per page. Naively running a single deep-extract pass over the entire corpus burns tokens on documents that turn out to be irrelevant. The MapAid groundwater pipeline illustrates the asymmetry: ~50% of the archive turned out to be water-relevant, ~50% wasn't. A naive one-pass pipeline would have spent half its inference budget extracting JSON records from documents nobody downstream would ever query.

You also can't skip the irrelevant documents — they still need to be discoverable (classified by topic, geography, presence of water data) so users can find them by category. So the irrelevant ones still need some processing; just not the same processing as the relevant ones.

Solution¶

Split the work into two passes with different per-page work profiles:

                ┌─── Pass 1 (cheap) ──────────────────┐
Every doc  ──>  │ Sampled pages: title, intro,        │ ─> Tags +
                │ conclusions only.                   │    routing flag
                │ Multimodal classification.          │    per doc
                │ ~30% of pages, ~30% of cost.        │
                └─────────────────────────────────────┘
                            │
                            │ filter on routing flag
                            ▼
                ┌─── Pass 2 (expensive) ──────────────┐
Selected docs > │ Every page: full multimodal OCR     │ ─> Structured
   (~50%)       │ + entity recognition.               │    JSON records
                │ Schema-constrained record extract.  │    per doc
                │ All pages, full per-page cost.      │
                └─────────────────────────────────────┘

In the MapAid pipeline:

Pass 1: Sample title pages, introductions, conclusions on long documents (per intelligent sampling). Run multimodal ai_query to emit Dewey Decimal codes + Sudanese geographies + water-relevance flag. "Reduced AI processing volume by more than 70% while preserving classification quality."
Pass 2: For each water-flagged document, process every page through the Foundation Model API for OCR + well/borehole identifier entity recognition. Merge text from all pages into a unified document representation. Run schema-constrained ai_query to emit JSON records for site name / GPS / depth / water level / yield.

"For each water-relevant document, the pipeline processes every page rather than just the sampled subset used for classification. OCR is performed page by page using a multimodal model… During OCR, the system also applies an entity recognition approach, identifying well and borehole identifiers as anchor entities so that records spanning multiple pages can be linked back to a single site." (Source: sources/2026-05-11-databricks-unlocking-the-archives)

Mechanics¶

Pass 1 sampling strategy is informed by document length, with informative sections preferred. See concepts/intelligent-document-sampling.
Routing flag in pass 1 is a typed boolean / categorical column on a Delta table. Pass 2 reads that table, filters, and runs per-page extraction.
Pass 2 anchor-entity linking lets multi-page records (coords on page 3, depth on page 7, yield on page 12) merge into a single output record. Without anchor entities, multi-page records would arrive as fragmented JSON objects.
Pass 1 quality is gated by an inline LLM-as-judge (see patterns/llm-judge-as-inline-pipeline-stage). Sub-threshold pass-1 classifications go to manual review before pass 2 wastes budget on a document whose flag may be wrong.

Why this beats a single uniform pass¶

Cost: budget concentrated on the ~50% that earns it.
Quality: pass 2 runs on every page (no sampling) precisely because pass 1 already filtered the corpus down to the high-value subset.
Iterability: prompt changes for classification (pass 1) don't re-trigger expensive extraction (pass 2). Schema changes for extraction don't re-trigger classification.
Observability: the routing-flag column is a natural debugging surface — "why didn't this document show up in extracted records" resolves at pass 1's flag value, not at the end of a single monolithic pipeline.

When to use¶

Corpora where only a fraction is the target of expensive processing.
Heterogeneous document corpora where classification is needed for discoverability and deep extraction is needed for the matching subset.
Pipelines where cost-per-page is high enough that filter-first pays for itself.

When not to use¶

Small corpora where the cost savings don't justify a two-stage pipeline.
Corpora where every document needs deep extraction (no routing/filtering signal exists).
Pipelines where the classification signal only exists in pages the sampler would skip (e.g. middle-of-document content).

Tradeoffs¶

Pipeline complexity. Two passes + a routing-flag table is more scaffolding than a single pass.
Pass 1 false-negatives silently exclude documents from pass 2 forever. Mitigation: inline judge on pass 1; periodic spot-check of flag-false documents.
Two iteration cycles when the schema for classification + extraction both change.

Seen in¶

sources/2026-05-11-databricks-unlocking-the-archives — canonical wiki instance. ~700-document archive; pass 1 with >70% sampling reduction over all pages produces water flags; pass 2 runs full-page OCR + JSON extraction only on the ~50% water-flagged subset producing 299 structured records.