Skip to content

PATTERN Cited by 1 source

Governed Delta Tables Per Modality

Shape

Every input modality (genomics, imaging features, clinical notes, wearables, …) lands in its own set of governed Delta tables under one Unity Catalog governance surface, with modality-specific tooling (Glow, Mosaic AI Vector Search, NLP entity-extraction pipelines, Lakeflow SDP) layered above the substrate rather than defining a separate stack per modality.

Failure mode this pattern names and replaces

"A common failure mode in cloud deployments is a 'specialty store per modality' approach (for example: a FHIR store, a separate omics store, a separate imaging store, and a separate feature or vector store). In practice, that often means duplicated governance and brittle cross-store pipelines — making lineage, reproducibility, and multimodal joins much harder to operationalize." (Source: sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-ai)

The specialty-store-per-modality anti-pattern:

  • N governance systems (one per store), each with its own ACL, audit trail, and tagging vocabulary.
  • N² bridging pipelines for cross-modal joins, each with its own copy of sensitive data.
  • No unified lineage — you can trace a feature back to its source inside one modality's store, but not across modalities.
  • Compliance burden multiplies with each new store added.

How the pattern responds

  • One substrateDelta Lake over object storage. Every modality writes here.
  • One governance surfaceUnity Catalog tags (PHI / PII / study ID / cohort / …), row- and column-level policies, audit, and controlled sharing apply uniformly across modalities.
  • One lineage graph — features and model inputs trace back across modalities to source datasets.
  • Modality-specific tooling layered above, not beside. The post's four worked examples:
  • GenomicsGlow (Spark-based VCF / BGEN / PLINK processing) → Delta tables of variants, biomarkers, annotations.
  • Imaging → radiomics / deep-model-derived feature embeddings stored as Delta tables; Mosaic AI Vector Search serves similarity queries over the vectors. "Find similar phenotypes within glioblastoma" is the illustrative query.
  • Clinical notes → NLP entity-extraction into Delta tables (med changes, symptoms, procedures, family history, timelines); raw text under stricter UC access controls; note-derived features join to imaging and omics.
  • WearablesLakeflow SDP streaming tables + materialised views (@dp.table + @dp.materialized_view decorators) handle schema evolution, late-arriving events, and continuous aggregation.

Why it works as a pattern

  • ACID + time travel (Delta Lake) gives reproducibility — consistent training sets, re-analysis against an old snapshot, audit-quality reproducibility.
  • One governance vocabulary — a single PHI-tag or study-ID predicate can scope all modalities at once; there is no per-store policy drift.
  • Cross-modal joins are first-class — joining imaging embeddings to genomics variants to notes-derived symptoms is a SQL join, not a cross-system ETL.
  • Lineage-across-modalities is free — features surface with their full upstream graph regardless of which modality's tooling produced them.

Forces

  • Lakehouse lock-in. The pattern assumes the substrate can hold every modality well enough; extremely specialised modality stores (sub-millisecond imaging stores, transactional OLTP for some workloads) may still warrant exceptions.
  • Modality-specific tooling must accept Delta as its native table format — that's why Databricks calls out Glow (native Delta outputs) and Lakeflow SDP (decorator-driven Delta tables) explicitly. Tooling that requires its own storage tier reintroduces the specialty-store problem.
  • Governance must be strong enough to replace per-store controls. The post's framing ("PHI / PII / 28 CFR Part 202 / StudyID / …" tags, row/column-level controls, audit, lineage, controlled sharing) is effectively the minimum UC contract for the pattern to displace per-store governance.

Consequences

Seen in

Last updated · 517 distilled / 1,221 read