PATTERN Cited by 1 source
Visual-First Document Extraction¶
Visual-First Document Extraction is the pipeline shape where scanned-document pages are rendered as images and sent directly to a multimodal LLM as the first processing step — not as post-processing-on-OCR-output. The model interprets the page visually and emits classifications, structured fields, or transcribed text in one inference call.
This is the operational flip of the conventional document pipeline: OCR is no longer a prerequisite stage; it is replaced by a multimodal inference call.
Problem¶
Classical document-processing pipelines look like:
Raw scan
→ deskew / rotate / enhance
→ OCR engine emits text + bounding boxes + confidence scores
→ language detection / script router
→ layout segmentation (paragraphs, tables, lists)
→ post-processing (NLP) over OCR'd text
→ classification / extraction model on text
→ output
Each stage has its own failure modes. Skewed or low-quality scans break OCR. Mixed-script documents (English + Arabic) require a language router. Handwritten field notes break OCR entirely. Tabular layouts confuse most layout segmenters. Each failure propagates downstream.
Solution¶
Render the page as an image. Send the image to a multimodal LLM with a prompt describing the desired output (classification tags / OCR'd text + entities / extracted JSON records). Consume the model's structured response.
The model handles deskew, language, layout, and content interpretation implicitly. Mixed scripts, handwritten notes, tabular data all flow through the same inference call.
In the MapAid groundwater pipeline:
"Rather than attempting OCR as a first step, the team reframed the problem as one of visual understanding: sending scanned page images directly to multimodal AI models that could interpret the content visually."
The pipeline classified 5,570 pages of mixed English/Arabic, decades-old, skewed, often handwritten geological documents — all through this pattern.
Mechanics¶
- Image rendering. PDFs/TIFFs/JPGs are rendered to a canonical image format and stored in Unity Catalog Volumes as a versioned foundational dataset.
- SQL-callable inference.
ai_querytakes the image column directly. No image-handling glue code in the pipeline. See patterns/sql-native-multimodal-llm-inference. - Schema-constrained output. The same inference call emits typed structured data — Dewey Decimal codes, geographies, water flag, well coordinates, depths, yields. See concepts/schema-constrained-llm-output.
- Multimodal endpoint. Heavy extraction passes use the Foundation Model API for full-page OCR + entity recognition.
When to use¶
- Heterogeneous-quality scans. Skewed, multi-orientation, multi-script, mixed-format pages where classical OCR breaks.
- Decades-old documents with no embedded text layer.
- Handwritten content that classical OCR can't transcribe reliably.
- Documents where the desired output is a classification or structured record (well coordinates, drug name + dosage, claim number + amount), not raw text.
When not to use¶
- High-volume, uniform-quality scans where classical OCR achieves >99% accuracy at a fraction of multimodal cost. The multimodal call is overkill if Tesseract handles it.
- Workflows that need OCR confidence scores per word. Multimodal models don't expose those reliably.
- Real-time latency budgets. Multimodal inference per page is 100ms–seconds, not microseconds.
Tradeoffs¶
- Cost. Multimodal inference is more expensive than classical OCR — addressable via intelligent sampling + two-pass classify- then-extract.
- Hallucination. A model interpreting a blurry handwritten note can confidently fabricate text. Mitigation: inline LLM-as-judge scoring outputs against the source page.
- Format / vendor lock-in. Rebuilding on a different model vendor means re-validating prompt + schema behaviour.
Seen in¶
- sources/2026-05-11-databricks-unlocking-the-archives — canonical wiki instance. ~700 documents / 5,570 pages of decades-old, mixed-script, partially-handwritten Sudanese geological surveys classified + extracted via this pattern; 95% rated excellent/good by the inline judge.
Related¶
- concepts/multimodal-document-understanding
- concepts/schema-constrained-llm-output
- concepts/intelligent-document-sampling
- systems/databricks-ai-functions
- systems/databricks-foundation-model-api
- patterns/two-pass-classify-then-deep-extract
- patterns/sql-native-multimodal-llm-inference
- patterns/llm-judge-as-inline-pipeline-stage