Skip to content

SYSTEM Cited by 2 sources

Unity Catalog Volumes

Unity Catalog Volumes are governed, versioned object-storage locations registered as first-class catalog assets inside Unity Catalog. They are the place non-tabular files (PDFs, images, audio, video, ML model artifacts) live when you want them inside the same governance surface as Delta tables.

Stub page. One ingested source so far.

Architectural role

Volumes are how the MapAid groundwater pipeline handles its raw input: the ~700 scanned PDFs / TIFFs / JPGs (>5,000 pages) plus the per-page rendered images all live in Unity Catalog Volumes. "Each document's pages are rendered as images and stored in Unity Catalog Volumes, creating a clean, versioned foundational dataset." (Source: sources/2026-05-11-databricks-unlocking-the-archives)

The pipeline then reads from those Volumes via ai_query — multimodal AI Functions can take Volume-stored image references as input columns. This composition (governed object storage → SQL-callable multimodal inference) is what lets the document-classification pipeline run as SQL/DataFrame jobs without standing up a separate file-handling service.

Why Volumes (not raw cloud-object-store)

  • Governance parity with Delta tables. Permissions, lineage, audit on the raw files live in the same UC surface as the output Delta tables.
  • Versioning. "Clean, versioned foundational dataset" — the raw archive is treated as a versioned snapshot, not a mutable bucket.
  • Pipeline-portability. The Asset Bundle references a Volume by name; pointing the bundle at a different archive is a config change, not a code change.

Seen in

  • sources/2026-05-22-databricks-how-world-bank-group-uses-databricks-to-eradicate-poverty-through-shared-knowledgeRAG-corpus substrate for a multi-domain knowledge platform. World Bank Group indexes "tens of millions of documents" (project documents, publications) into UC Volumes paired with Vector Search to power a retrieval-augmented-generation capability — "using Databricks Volumes and vector search, they indexed project documents to create a retrieval-augmented generation capability that could respond to natural language queries and thus save manual search." Operational scale: 3M document downloads / month through the AI-powered search-and-synthesis layer, half from low- and middle-income countries. Caveat: chunking strategy, embedding model, indexing cadence, and per-document metadata schema not disclosed. The Volumes-as-RAG-corpus shape generalises the prior MapAid pattern (Volumes-as-multimodal-input-substrate) to a Volumes-as-document-corpus-fronting-Vector-Search composition.

  • sources/2026-05-11-databricks-unlocking-the-archives — canonical wiki instance. Stores raw scanned PDFs + per-page rendered images as the input substrate for the multimodal classification pipeline.

Last updated · 542 distilled / 1,571 read