PATTERN Cited by 1 source

Visual-First Document Extraction¶

Visual-First Document Extraction is the pipeline shape where scanned-document pages are rendered as images and sent directly to a multimodal LLM as the first processing step — not as post-processing-on-OCR-output. The model interprets the page visually and emits classifications, structured fields, or transcribed text in one inference call.

This is the operational flip of the conventional document pipeline: OCR is no longer a prerequisite stage; it is replaced by a multimodal inference call.

Problem¶

Classical document-processing pipelines look like:

Raw scan
  → deskew / rotate / enhance
  → OCR engine emits text + bounding boxes + confidence scores
  → language detection / script router
  → layout segmentation (paragraphs, tables, lists)
  → post-processing (NLP) over OCR'd text
  → classification / extraction model on text
  → output

Each stage has its own failure modes. Skewed or low-quality scans break OCR. Mixed-script documents (English + Arabic) require a language router. Handwritten field notes break OCR entirely. Tabular layouts confuse most layout segmenters. Each failure propagates downstream.

Solution¶

Render the page as an image. Send the image to a multimodal LLM with a prompt describing the desired output (classification tags / OCR'd text + entities / extracted JSON records). Consume the model's structured response.

Raw scan
  → render page as image
  → multimodal ai_query(prompt, page_image, output_schema)
  → output

The model handles deskew, language, layout, and content interpretation implicitly. Mixed scripts, handwritten notes, tabular data all flow through the same inference call.

In the MapAid groundwater pipeline:

"Rather than attempting OCR as a first step, the team reframed the problem as one of visual understanding: sending scanned page images directly to multimodal AI models that could interpret the content visually."

The pipeline classified 5,570 pages of mixed English/Arabic, decades-old, skewed, often handwritten geological documents — all through this pattern.

Mechanics¶

Image rendering. PDFs/TIFFs/JPGs are rendered to a canonical image format and stored in Unity Catalog Volumes as a versioned foundational dataset.
SQL-callable inference. ai_query takes the image column directly. No image-handling glue code in the pipeline. See patterns/sql-native-multimodal-llm-inference.
Schema-constrained output. The same inference call emits typed structured data — Dewey Decimal codes, geographies, water flag, well coordinates, depths, yields. See concepts/schema-constrained-llm-output.
Multimodal endpoint. Heavy extraction passes use the Foundation Model API for full-page OCR + entity recognition.

When to use¶

Heterogeneous-quality scans. Skewed, multi-orientation, multi-script, mixed-format pages where classical OCR breaks.
Decades-old documents with no embedded text layer.
Handwritten content that classical OCR can't transcribe reliably.
Documents where the desired output is a classification or structured record (well coordinates, drug name + dosage, claim number + amount), not raw text.

When not to use¶

High-volume, uniform-quality scans where classical OCR achieves >99% accuracy at a fraction of multimodal cost. The multimodal call is overkill if Tesseract handles it.
Workflows that need OCR confidence scores per word. Multimodal models don't expose those reliably.
Real-time latency budgets. Multimodal inference per page is 100ms–seconds, not microseconds.

Tradeoffs¶

Cost. Multimodal inference is more expensive than classical OCR — addressable via intelligent sampling + two-pass classify- then-extract.
Hallucination. A model interpreting a blurry handwritten note can confidently fabricate text. Mitigation: inline LLM-as-judge scoring outputs against the source page.
Format / vendor lock-in. Rebuilding on a different model vendor means re-validating prompt + schema behaviour.

Seen in¶

sources/2026-05-11-databricks-unlocking-the-archives — canonical wiki instance. ~700 documents / 5,570 pages of decades-old, mixed-script, partially-handwritten Sudanese geological surveys classified + extracted via this pattern; 95% rated excellent/good by the inline judge.