PATTERN Cited by 1 source
Governed Delta Tables Per Modality¶
Shape¶
Every input modality (genomics, imaging features, clinical notes, wearables, …) lands in its own set of governed Delta tables under one Unity Catalog governance surface, with modality-specific tooling (Glow, Mosaic AI Vector Search, NLP entity-extraction pipelines, Lakeflow SDP) layered above the substrate rather than defining a separate stack per modality.
Failure mode this pattern names and replaces¶
"A common failure mode in cloud deployments is a 'specialty store per modality' approach (for example: a FHIR store, a separate omics store, a separate imaging store, and a separate feature or vector store). In practice, that often means duplicated governance and brittle cross-store pipelines — making lineage, reproducibility, and multimodal joins much harder to operationalize." (Source: sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-ai)
The specialty-store-per-modality anti-pattern:
- N governance systems (one per store), each with its own ACL, audit trail, and tagging vocabulary.
- N² bridging pipelines for cross-modal joins, each with its own copy of sensitive data.
- No unified lineage — you can trace a feature back to its source inside one modality's store, but not across modalities.
- Compliance burden multiplies with each new store added.
How the pattern responds¶
- One substrate — Delta Lake over object storage. Every modality writes here.
- One governance surface — Unity Catalog tags (PHI / PII / study ID / cohort / …), row- and column-level policies, audit, and controlled sharing apply uniformly across modalities.
- One lineage graph — features and model inputs trace back across modalities to source datasets.
- Modality-specific tooling layered above, not beside. The post's four worked examples:
- Genomics → Glow (Spark-based VCF / BGEN / PLINK processing) → Delta tables of variants, biomarkers, annotations.
- Imaging → radiomics / deep-model-derived feature embeddings stored as Delta tables; Mosaic AI Vector Search serves similarity queries over the vectors. "Find similar phenotypes within glioblastoma" is the illustrative query.
- Clinical notes → NLP entity-extraction into Delta tables (med changes, symptoms, procedures, family history, timelines); raw text under stricter UC access controls; note-derived features join to imaging and omics.
- Wearables →
Lakeflow SDP
streaming tables + materialised views (
@dp.table+@dp.materialized_viewdecorators) handle schema evolution, late-arriving events, and continuous aggregation.
Why it works as a pattern¶
- ACID + time travel (Delta Lake) gives reproducibility — consistent training sets, re-analysis against an old snapshot, audit-quality reproducibility.
- One governance vocabulary — a single PHI-tag or study-ID predicate can scope all modalities at once; there is no per-store policy drift.
- Cross-modal joins are first-class — joining imaging embeddings to genomics variants to notes-derived symptoms is a SQL join, not a cross-system ETL.
- Lineage-across-modalities is free — features surface with their full upstream graph regardless of which modality's tooling produced them.
Forces¶
- Lakehouse lock-in. The pattern assumes the substrate can hold every modality well enough; extremely specialised modality stores (sub-millisecond imaging stores, transactional OLTP for some workloads) may still warrant exceptions.
- Modality-specific tooling must accept Delta as its native table format — that's why Databricks calls out Glow (native Delta outputs) and Lakeflow SDP (decorator-driven Delta tables) explicitly. Tooling that requires its own storage tier reintroduces the specialty-store problem.
- Governance must be strong enough to replace per-store controls. The post's framing ("PHI / PII / 28 CFR Part 202 / StudyID / …" tags, row/column-level controls, audit, lineage, controlled sharing) is effectively the minimum UC contract for the pattern to displace per-store governance.
Consequences¶
- "Fewer data copies and fewer one-off pipelines." (Source: sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-ai)
- Cross-modal features become a query over one catalog rather than an integration project.
- Compliance posture is defined once at the catalog and inherited by every modality.
- New modalities are added by "land in Delta + tag under UC + expose via modality-specific tooling" — an additive move rather than a new stack.
Related patterns in the wiki¶
- concepts/data-lakehouse — the substrate class the pattern assumes.
- concepts/medallion-architecture — the bronze/silver/gold progression typically layered inside each modality's table set.
- patterns/telemetry-to-lakehouse — the analogous pattern for AI telemetry, with Unity AI Gateway + OpenTelemetry → UC-managed Delta tables.
- patterns/fusion-strategy-selection-by-deployment-reality — the modelling-side decision that assumes this lakehouse-substrate pattern holds underneath.
Seen in¶
- sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-ai — canonical instance. Databricks names the specialty-store-per- modality anti-pattern explicitly and positions governed Delta tables under Unity Catalog as the unifying substrate, with Glow (genomics), Mosaic AI Vector Search (imaging), NLP pipelines (notes), and Lakeflow SDP (wearables) as the four modality-specific tools layered above.