Databricks — Unlocking the Archives: Turning Unstructured Documents into a Searchable Database for Groundwater Discovery¶
Summary¶
Databricks for Good partnered with MapAid
(a Stanford-founded nonprofit) and the Sudan Association for Archiving
Knowledge (SUDAAK) to turn ~700 scanned PDFs/TIFFs/JPGs (>5,000 pages) of
Sudanese geological surveys, well-drilling reports, and field studies into
a structured, searchable catalog that feeds MapAid's WellMapr groundwater
prediction models. The pipeline reframes OCR as a visual-understanding
problem rather than a text-extraction problem: pages are rendered as
images and sent directly to a multimodal model via Databricks AI
Functions (ai_query) for classification (Dewey Decimal
codes + Sudanese geographies + water-relevance flag), then water-flagged
documents are processed page-by-page with the
Foundation Model API for OCR
+ entity recognition, and finally re-summarised into JSON well/borehole
records via schema-constrained ai_query. A second AI model acts as a
judge inline in the pipeline scoring every
classification on accuracy / completeness / consistency. The whole thing
ships as a Databricks Asset Bundle
runnable with one command, orchestrated as a Lakeflow Job on serverless
compute. First production run: 654 documents / 5,570 pages classified
in <3 hours, 95% rated excellent/good by the automated judge,
~50% identified as water-relevant, 299 structured well/borehole
records extracted.
Key takeaways¶
- Reframe OCR as visual understanding. Decades-old scans, mixed English/Arabic, skewed pages, and handwritten field notes ruled out traditional text extraction. The team rendered every page as an image and sent it directly to a multimodal model — sidestepping OCR brittleness by treating the page as a picture the model can interpret rather than text it must extract (Source: sources/2026-05-11-databricks-unlocking-the-archives). Canonicalised in patterns/visual-first-document-extraction.
- Intelligent sampling cuts cost ~70% without quality loss. Short documents are analysed in full; long documents are sampled from their most informative sections (title pages, introductions, conclusions). Page-level results aggregate up to document-level classifications. This reduced AI processing volume by >70% while preserving classification quality (Source: sources/2026-05-11-databricks-unlocking-the-archives). Canonicalised in concepts/intelligent-document-sampling.
ai_query+ schema-constrained output as SQL-native LLM inference. Databricks AI Functions (ai_query) natively support multimodal inputs and structured JSON output from inside SQL. The team iterates on prompts and output schemas without building separate model-serving infrastructure. Schema-constrained responses enforce consistent capture of site name / GPS / depth / water level / pump-test yield even when those fields appear in different formats across documents (Source: sources/2026-05-11-databricks-unlocking-the-archives). See concepts/schema-constrained-llm-output + patterns/sql-native-multimodal-llm-inference.- Two-pass pipeline: classify cheaply, extract expensively only where it matters. First pass uses sampled pages + multimodal classification to tag every document. Second pass triggers full-page OCR + entity- anchored linking + JSON extraction only on water-flagged documents. This is a budget-allocation pattern: pay the expensive token cost only on the ~50% of the corpus that earns it (Source: sources/2026-05-11-databricks-unlocking-the-archives). Canonicalised in patterns/two-pass-classify-then-deep-extract.
- Entity recognition as cross-page anchor for record assembly. During full-page OCR the system extracts well/borehole identifiers as anchor entities. Records spanning multiple pages (coordinates on page 3, depth on page 7, yield in a summary table on page 12) are linked back to a single site by anchor. Extracted text from all pages is merged into a unified document representation, then re-processed in a second pass to emit JSON records (Source: sources/2026-05-11-databricks-unlocking-the-archives).
- LLM-as-judge baked into the pipeline as a first-class stage, not a post-hoc audit. A separate AI model (also via AI Functions) acts as judge: it scores every classification on a structured rubric (accuracy / completeness / consistency), producing both a categorical rating (excellent / good / fair / poor) and a written justification — "creating an auditable trail for every decision the pipeline makes." Documents below threshold are flagged for manual review. "In the first full run, only a small fraction of classifications required human attention." (Source: sources/2026-05-11-databricks-unlocking-the-archives). See concepts/llm-as-judge + patterns/llm-judge-as-inline-pipeline-stage.
- Single-command deployment via Databricks Asset Bundles. The whole pipeline (Unity Catalog Volumes for raw files, Delta Lake for outputs, Lakeflow Job orchestration on serverless compute, AI Functions for inference, judge model) is packaged as a Databricks Asset Bundle — deployable, updatable, and runnable with one command. "MapAid received a self-contained solution that can be maintained without expertise across multiple cloud services. Because the pipeline logic is decoupled from the specific archive it processes, the same system could be adapted to other water archives, other regions, or other domains." (Source: sources/2026-05-11-databricks-unlocking-the-archives).
- Operating envelope (first full run). 654 documents / 5,570 pages classified in <3 hours on serverless compute; 95% of classifications rated excellent/good by the inline judge; ~50% of the archive flagged as water-relevant; 299 structured well/borehole records extracted with location, depth, and yield. Replaces what "would have taken domain experts weeks or months." (Source: sources/2026-05-11-databricks-unlocking-the-archives).
Systems extracted¶
- systems/databricks-ai-functions —
ai_querySQL-native LLM inference with multimodal input + structured JSON output. The primary inference primitive in this pipeline. - systems/databricks-foundation-model-api — Multimodal model endpoint serving the OCR pass on water-flagged documents.
- systems/databricks-asset-bundles — Single-command deployment + packaging unit for the entire pipeline.
- systems/lakeflow-jobs — Orchestrator for the multi-stage pipeline on serverless compute.
- systems/unity-catalog-volumes — Object-storage substrate for the raw scanned PDFs / TIFFs / JPGs and their rendered page images. Versioned + governed.
- systems/unity-catalog — Governance layer over the volumes + Delta tables.
- systems/delta-lake — Storage substrate for pipeline output tables (page-level classifications, document-level aggregates, judge scores, extracted JSON records).
Concepts extracted¶
- concepts/multimodal-document-understanding — Treat scanned pages as images interpreted by a vision-language model rather than text extracted by OCR.
- concepts/intelligent-document-sampling — Sample title pages, introductions, and conclusions of long documents; process short documents in full. ~70% volume reduction at preserved quality.
- concepts/schema-constrained-llm-output — JSON-schema-enforced structured output from LLM calls; consistent field capture across format-variant documents.
- concepts/llm-as-judge — Inline-pipeline LLM evaluator scoring every model output against a rubric. (Existing concept; this article is a fresh instance specifically as a first-class pipeline stage rather than a post-hoc eval.)
- concepts/dewey-decimal-classification — Universal library classification system used as the categorical taxonomy for output tags.
Patterns extracted¶
- patterns/visual-first-document-extraction — Skip OCR-as-first-step; send page images directly to a multimodal model.
- patterns/two-pass-classify-then-deep-extract — Cheap classification pass over the whole corpus on sampled pages → expensive full-page extraction only on the matched subset.
- patterns/llm-judge-as-inline-pipeline-stage — Judge model embedded in the pipeline as a first-class stage with categorical rating + written justification, gating documents into a manual-review queue below threshold.
- patterns/sql-native-multimodal-llm-inference — Use SQL-callable
ai_querywith structured-output schemas to iterate on prompts + shapes without standing up separate serving infra. - patterns/asset-bundle-single-command-deployment — Pipeline packaged + versioned as a deployable bundle so domain partners can operate it without multi-cloud expertise.
Operational numbers¶
| Dimension | Value |
|---|---|
| Documents classified (first run) | 654 |
| Pages classified (first run) | 5,570 |
| Wall-clock pipeline time | <3 hours |
| Quality rated excellent/good (by judge) | 95% |
| Archive flagged water-relevant | ~50% |
| Structured well/borehole records extracted | 299 |
| AI processing volume reduction from sampling | >70% |
| Manual review burden | "small fraction of classifications" (not quantified) |
Caveats¶
- Tier-3 customer story. This is a Databricks blog post about a Databricks-for-Good partnership; it foregrounds platform features (AI Functions, Foundation Model API, Asset Bundles, Lakeflow, Unity Catalog, Delta Lake). Architectural detail is genuine but framing is product-marketing-adjacent. Read for the patterns (visual-first extraction, two-pass classify+extract, inline LLM judge), not as a benchmark of Databricks vs alternatives.
- No latency/cost-per-page numbers disclosed. "<3 hours for 5,570 pages" is the only thoughput-ish data point; no per-page tokens, no $/page, no judge-model identity, no primary-model identity.
- Sampling-strategy quality claim is unaudited. The ">70% volume reduction while preserving classification quality" claim is the team's own assessment via the judge model — which is itself part of the pipeline. There's no external held-out evaluation.
- "Small fraction" of judge-flagged manual-review documents is not quantified. No absolute number, no false-negative analysis on the 95% rated excellent/good.
- Generalisation claim is forward-looking, not measured. The "could be adapted to other water archives, other regions, or other domains" framing is a structural argument from the Asset Bundle decoupling — not evidence of cross-domain transfer.
- No internal Databricks platform details. Unlike the Superhuman 200K-QPS post or the Serverless Compute architecture post, this article describes a customer-facing pipeline assembled from existing platform pieces; it does not disclose how AI Functions, the Foundation Model API, or Asset Bundles are implemented.
Source¶
- Original: https://www.databricks.com/blog/unlocking-archives-turning-unstructured-documents-searchable-database-groundwater-discovery
- Raw markdown:
raw/databricks/2026-05-11-unlocking-the-archives-turning-unstructured-documents-into-a-62276af3.md
Related¶
- companies/databricks
- systems/databricks-ai-functions
- systems/databricks-foundation-model-api
- systems/databricks-asset-bundles
- systems/unity-catalog-volumes
- systems/unity-catalog
- systems/delta-lake
- concepts/multimodal-document-understanding
- concepts/intelligent-document-sampling
- concepts/schema-constrained-llm-output
- concepts/llm-as-judge
- concepts/dewey-decimal-classification
- patterns/visual-first-document-extraction
- patterns/two-pass-classify-then-deep-extract
- patterns/llm-judge-as-inline-pipeline-stage
- patterns/sql-native-multimodal-llm-inference
- patterns/asset-bundle-single-command-deployment