SYSTEM Cited by 2 sources
Unity Catalog Volumes¶
Unity Catalog Volumes are governed, versioned object-storage locations registered as first-class catalog assets inside Unity Catalog. They are the place non-tabular files (PDFs, images, audio, video, ML model artifacts) live when you want them inside the same governance surface as Delta tables.
Stub page. One ingested source so far.
Architectural role¶
Volumes are how the MapAid groundwater pipeline handles its raw input: the ~700 scanned PDFs / TIFFs / JPGs (>5,000 pages) plus the per-page rendered images all live in Unity Catalog Volumes. "Each document's pages are rendered as images and stored in Unity Catalog Volumes, creating a clean, versioned foundational dataset." (Source: sources/2026-05-11-databricks-unlocking-the-archives)
The pipeline then reads from those Volumes via
ai_query — multimodal AI
Functions can take Volume-stored image references as input columns.
This composition (governed object storage → SQL-callable multimodal
inference) is what lets the document-classification pipeline run as
SQL/DataFrame jobs without standing up a separate file-handling
service.
Why Volumes (not raw cloud-object-store)¶
- Governance parity with Delta tables. Permissions, lineage, audit on the raw files live in the same UC surface as the output Delta tables.
- Versioning. "Clean, versioned foundational dataset" — the raw archive is treated as a versioned snapshot, not a mutable bucket.
- Pipeline-portability. The Asset Bundle references a Volume by name; pointing the bundle at a different archive is a config change, not a code change.
Seen in¶
-
sources/2026-05-22-databricks-how-world-bank-group-uses-databricks-to-eradicate-poverty-through-shared-knowledge — RAG-corpus substrate for a multi-domain knowledge platform. World Bank Group indexes "tens of millions of documents" (project documents, publications) into UC Volumes paired with Vector Search to power a retrieval-augmented-generation capability — "using Databricks Volumes and vector search, they indexed project documents to create a retrieval-augmented generation capability that could respond to natural language queries and thus save manual search." Operational scale: 3M document downloads / month through the AI-powered search-and-synthesis layer, half from low- and middle-income countries. Caveat: chunking strategy, embedding model, indexing cadence, and per-document metadata schema not disclosed. The Volumes-as-RAG-corpus shape generalises the prior MapAid pattern (Volumes-as-multimodal-input-substrate) to a Volumes-as-document-corpus-fronting-Vector-Search composition.
-
sources/2026-05-11-databricks-unlocking-the-archives — canonical wiki instance. Stores raw scanned PDFs + per-page rendered images as the input substrate for the multimodal classification pipeline.