Skip to content

Databricks — Clinical operations intelligence belongs on the Lakehouse

Databricks Blog post (2026-05-13) announcing the open-source release of the Site Feasibility Workbench as a fully open-source Databricks App — pitched as a reference implementation of "clinical operations intelligence when the application, the models, and the data live on the same platform." Tier-3 vendor-blog source. Borderline-include: framed as a product launch for a clinical-trials use case, but the architecturally load-bearing sections ("The Architecture Argument" + "The Auditability Argument") make a specific design claim worth canonicalising — that running the application inside the data platform workspace eliminates three integration layers (operational-DB sync pipeline, credential rotation surface, semantic-harmonization drift) and unlocks a cleaner ML-audit substrate.

One-paragraph summary

The conventional decision-support architecture has an analytical warehouse, a separate operational application database, a synchronization pipeline between them, and a web tier on top. Each layer adds "integration overhead, credential surface area, and a synchronization lag that erodes trust in the data the application shows." The post argues a single-platform shape eliminates these layers: Databricks Apps runs the web app inside the workspace, authenticating as a first-class workspace service principal and querying Unity Catalog over internal connections via the SQL Statement API; systems/lakebase is the operational database layer (managed Postgres, scale-to-zero, provisioned and credentialed by the workspace identity system); AI/BI Genie gives natural-language access to the same governed Unity Catalog tables the ML models trained on. The result is "a clinical operations application that makes no external API calls, maintains no separate operational database infrastructure, and requires no synchronization pipeline between the analytical and operational layers." On the ML side, the post canonicalises a regulated-ML audit shape: every prediction emits a SHAP attribution stored as a governed Unity Catalog Delta table, versioned in MLflow, lineaged through Unity Catalog, queryable in SQL"the rationale behind a site selection is as auditable as the score itself." The reference implementation (Site Feasibility Workbench) is a six-step guided workflow with TA-segmented LightGBM models trained on the sponsor's own CTMS / EDC / IRT history (institutional knowledge as training data, not industry averages); FastAPI backend + React frontend; deploys into a Databricks workspace with Unity Catalog in "approximately 30 minutes of technical deployment time."

Key takeaways

  1. The thesis is "applications should live where their data and models live" — a single-platform application architecture that collapses the standard analytical-DB
  2. operational-DB + sync-pipeline + web-tier stack into one workspace-resident composition. "The conventional approach to clinical decision-support looks like this: analytical data lives in a data warehouse or Lakehouse. A separate application database holds operational state. A pipeline keeps them loosely synchronized. A web application sits in front of both… Every layer introduces integration overhead, credential surface area, and a synchronization lag that erodes trust in the data the application shows." "Databricks Apps, Lakebase, and AI/BI Genie eliminate each of those layers — not by abstracting them away but by making them unnecessary." Canonical wiki instance — first source naming this architectural shape explicitly.

  3. Databricks Apps runs the web tier inside the workspace boundary. "Databricks Apps run the web application inside the workspace. The app authenticates as a first-class workspace service principal, queries Unity Catalog via the SQL Statement API, and calls AI/BI Genie over the workspace REST API — all on internal connections. Clinical operations data never crosses a workspace boundary. The app inherits Unity Catalog access controls without any additional configuration." Three structural properties: (a) service-principal auth as the app's identity (not a per-user OAuth surface), (b) SQL Statement API as the query path against UC tables (not a JDBC driver shipping credentials around), (c) REST API to Genie over internal connections (not a public Genie endpoint). The unifying claim: "the app inherits Unity Catalog access controls without any additional configuration" — UC's row-filter / column-mask / tag-driven-policy machinery composes onto the app for free, so per-user PHI handling rides on the catalog's HIPAA Safe Harbor / Expert Determination posture configured at the catalog or schema level. First wiki disclosure of Databricks Apps as a deployment model.

  4. systems/lakebase is the operational-DB layer with no separate credential surface. "Lakebase is the operational database layer — managed PostgreSQL that scales to zero when idle, provisioned and credentialed entirely within the workspace identity system. Where a traditional application would require a separately managed RDS instance with its own schema drift, sync jobs, and credential rotation, Lakebase is in the same platform where the data and models live." The Site Feasibility Workbench uses Lakebase to persist saved shortlists for team sharing (the operational state that doesn't belong in UC's analytical tables). New face for Lakebase on the wiki: not just OLTP-companion-to-the-Lakehouse but app-tier-state-store-without-its-own-credential-surface — composes with Databricks Apps via the shared workspace identity system.

  5. AI/BI Genie is the embedded NL-query layer in the application workflow, not a separate UI. "AI/BI Genie closes the last gap: natural language access to governed data, embedded directly in the application workflow. Study managers ask questions in plain English against the same Unity Catalog tables the ML models trained on, with the same access controls applied." The architectural shift is from "Genie room as separate web product" (the Trinity Industries pattern documented in the 2026-04-29 source) to "Genie embedded in a domain-specific app via the workspace REST API" — a new face for Genie. The composition property: "the same Unity Catalog tables the ML models trained on" — Genie, the ML training pipeline, and the UC governance policies all evaluate against the same catalog, so a study manager asking "why was this site recommended?" gets answers grounded in the same data the model used.

  6. SHAP attributions stored as governed Unity Catalog Delta tables make ML rationale as auditable as the prediction. Canonical architectural contribution. "Every prediction carries a SHAP attribution stored as a governed Unity Catalog Delta table — versioned in MLflow, lineaged through Unity Catalog, queryable — the rationale behind a site selection is as auditable as the score itself. A clinical affairs team can answer a question from a data monitoring committee with a SQL query, not a black-box vendor report." Three properties the storage substrate inherits: (a) versioning via MLflow (which model version produced which attribution), (b) lineage via Unity Catalog (which training data informed which feature contribution), (c) SQL queryability (a regulator's question becomes a SELECT, not a vendor support ticket). Canonical instance of patterns/shap-attribution-as-governed-delta-table and concepts/governed-shap-attribution-table.

  7. Explainability is reframed as a fairness control, not just a regulatory artifact. "Because every prediction is decomposed into governed SHAP attributions, sponsors can audit recommendations for systematic under-weighting of community sites, minority-serving institutions, or first-time investigators — turning explainability into a fairness control." The post links this back to FDA's Diversity Action Plan expectations under FDORA 2022. The architectural enabler is the queryable-attribution substrate: systematic under-weighting can only be detected when attributions are stored as a queryable population, not as opaque per-prediction responses. The wiki's first source making this fairness-via-SQL claim.

  8. Institutional ML data flips industry-average baselines into sponsor-specific signal. "The standard industry approach to site feasibility relies on commercial scoring products from vendors or CRO-provided analytics platforms. Those tools are built on aggregated industry data — useful as a baseline, but blind to the specifics of your portfolio. A sponsor with a decade of CTMS, EDC, and IRT history carries significant signals about how their sites perform on their protocols." The compounding-return claim: "every new study makes the prediction better, and every new site relationship is reflected in the next training run." Models are TA-segmented LightGBM (therapeutic-area segmentation as a structural feature, not a generic random forest). Public-data layer: CMS Open Payments is added "when used appropriately, correlates with research engagement and infrastructure and it is freely available."

  9. Regulated ML lineage is explicitly grounded in three regulatory artefacts. The post cites 21 CFR Part 11 (electronic records / signatures), ICH E6(R3) (good clinical practice guideline), and FDA's Good Machine Learning Practice (GMLP) guidance as the regulatory framing that makes "model explainability and data governance material considerations, not optional features." The architectural substrate (MLflow + UC + governed SHAP Delta tables) is positioned as the implementation of these guidances. Wiki's first source canonicalising the regulatory-driver-for-ML-audit-substrate framing.

  10. The medallion architecture composes the platform primitives end-to-end. Figure 1 in the post lays out the unified stack: external sources (CTMS / EDC / IRT) ingest via Lakeflow (Bronze → Silver → Gold); Mosaic AI trains AI/ML models and writes versioned predictions back to Unity Catalog; SQL Warehouse + Lakebase + AI/BI Genie all serve the Databricks App; the App runs inside the platform boundary "with all connections internal." Canonical wiki instance of the end-to-end medallion-architecture-into-app-tier composition where every layer (storage / governance / training / inference / NL-query / app) lives in one workspace.

  11. Operational numbers are sparse on the engineering side and industry-heavy on the clinical-trial side. Engineering: Site Feasibility Workbench "deploys into an existing Databricks workspace with Unity Catalog in approximately 30 minutes of technical deployment time, before sponsor-specific security review and validation." Clinical-industry numbers (cited in the framing not the architecture): 37% of activated sites under-enrol targets, 11% enrol zero patients, 53% of trials exceed enrolment timelines, 1 in 6 trials take more than twice as long as planned, "up to $500,000 per day in unrealized drug sales and $40,000 per day in direct trial costs" per Tufts CSDD. The post explicitly notes "That combined underperformance rate has remained essentially flat for at least two decades. The tools are not the problem. The architecture is." — framing the platform pitch as an architecture-level intervention.

Architecture-shape summary

┌─────────────────────────────────────────────────────────────┐
│ Databricks Workspace (single boundary, all internal links)  │
│                                                             │
│  ┌───────────────┐                                          │
│  │ External      │   Lakeflow                               │
│  │ sources       │ ──────────► Bronze ── Silver ── Gold     │
│  │ (CTMS/EDC/IRT)│   (ingest)            ↓                  │
│  └───────────────┘                       │                  │
│                          ┌───────────────┴────────────┐     │
│                          │ Unity Catalog              │     │
│                          │ (governance + lineage      │     │
│                          │ + access controls)         │     │
│                          └────┬─────────────────┬─────┘     │
│                               │                 │           │
│                  ┌────────────┴────┐  ┌─────────┴────────┐  │
│                  │ Mosaic AI       │  │ SQL Warehouse    │  │
│                  │ + MLflow        │  │ AI/BI Genie      │  │
│                  │ (train/version) │  │ (NL query)       │  │
│                  └────────┬────────┘  └─────────┬────────┘  │
│                           │ predictions          │          │
│                           │ + SHAP-attribution   │          │
│                           ▼ Delta tables         │          │
│                  ┌──────────────────────────────┴────────┐  │
│                  │ Databricks App (FastAPI + React)      │  │
│                  │ - service-principal auth              │  │
│                  │ - SQL Statement API → UC              │  │
│                  │ - REST API → Genie                    │  │
│                  │ - Lakebase for app state              │  │
│                  └───────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘
                  No external API calls.
                  No separate operational DB infrastructure.
                  No analytical↔operational sync pipeline.

Caveats

  • Architecture density is ~30% of body, with the remaining ~70% spent on clinical-trial industry context (Tufts CSDD numbers, CRO vendor critique, FDORA 2022 / Diversity Action Plan, four-module Clinical Operations Intelligence Hub roadmap). The wiki ingests the architectural slice; the clinical-industry framing is noted but not canonicalised.
  • No latency / throughput / scaling numbers disclosed for any of the platform primitives in this post. "Approximately 30 minutes of technical deployment time" is the only quantified engineering metric. Numbers like SQL Statement API latency, Genie REST API call latency, Lakebase scale-to-zero cold-start, SHAP-attribution Delta-table query performance at audit-population scale — none disclosed.
  • Mosaic AI mentioned in passing as the model-training layer; not given mechanism-level disclosure sufficient to canonicalise as a wiki system page. Reserved for a future Mosaic-AI-internals source.
  • Site Feasibility Workbench is the first public release of a broader platform — Patient Cohort and Recruitment, Enrollment Velocity Optimizer, and Risk-Based Monitoring and Compliance are named as future modules but not detailed. "All four deploy as Databricks Apps. All four query Unity Catalog directly. None make external API calls."
  • The post is structurally a product launch for the open-source Site Feasibility Workbench — a sibling-genre to the Tier-3 Databricks-Genie-per-vertical product-PR cluster the wiki has been skipping (healthcare-OR / life-sciences-RWE / wealth-management / telecom-churn / manufacturing-quality / retail-markdown). The distinguishing factor that flips this post into in-scope: the "Architecture Argument" + "Auditability Argument" sections name specific platform primitives and articulate a concrete architectural thesis (single-platform composition of app + UC + Lakebase + Genie
  • MLflow + SHAP Delta tables), where the skipped Genie-per-vertical cluster lacked any platform-internals disclosure.

Source

Last updated · 542 distilled / 1,571 read