Skip to content

SYSTEM Cited by 1 source

Site Feasibility Workbench

The Site Feasibility Workbench is an open-source clinical-trial site-selection application released by Databricks in May 2026 — pitched as the first public reference implementation of a Databricks App composed onto systems/lakebase, systems/unity-catalog, AI/BI Genie, and systems/mlflow.

Stub page. The wiki tracks this system primarily as a reference implementation of the single-platform application architecture thesis, not for the clinical-trials domain content. Source repository: databricks-industry-solutions/site-feasibility-workbench-open.

What it does

Six-step guided workflow for clinical-trial site selection:

  1. Protocol selection — therapeutic area / phase / inclusion-and-exclusion criteria.
  2. Score constraints — diversity weighting, minimum-enrollment thresholds, geographic preferences.
  3. Geographic overview — site distribution map.
  4. Site ranking — composite feasibility scores from TA-segmented LightGBM models.
  5. SHAP-driven site deep dive — per-prediction feature attribution for each candidate site.
  6. Final shortlist — saved shortlists persisted to Lakebase for team sharing.

Cross-cutting: AI/BI Genie answers cross-domain natural-language questions against the same Unity Catalog tables the ML models trained on.

Stack

Layer Component Notes
Frontend React Workflow UI + map / charts / shortlist tables.
Backend FastAPI (Python) Routes through the Databricks Apps runtime.
Auth Workspace service principal App identity is provisioned by the workspace identity system.
Analytical data Unity Catalog via SQL Statement API Site features, historical performance, predictions, SHAP attributions, audit log.
Operational state systems/lakebase (Postgres) Saved shortlists, team-sharing state.
NL query AI/BI Genie via workspace REST API Embedded in the workflow, not a separate UI.
ML TA-segmented LightGBM models Trained on sponsor's CTMS / EDC / IRT history.
Lineage systems/mlflow + Unity Catalog Every model version + every SHAP attribution lineaged.
Deployment Databricks workspace with Unity Catalog "Approximately 30 minutes of technical deployment time, before sponsor-specific security review and validation."

All connections internal. No external API calls. No separate operational-DB infrastructure outside the workspace.

Composite feasibility score inputs

From the post: "Composite feasibility scores combine real-world evidence, patient access data, historical site performance, site qualification history, Open Payments KOL signal, and protocol execution factors — all driven by TA-segmented LightGBM models trained on the organization's own CTMS, EDC, and IRT history."

  • CTMS — Clinical Trial Management System (sponsor-owned trial metadata + site relationships).
  • EDC — Electronic Data Capture (per-trial subject-level data).
  • IRT — Interactive Response Technology (randomization + drug supply per trial).
  • CMS Open Payments — public dataset; "when used appropriately, correlates with research engagement and infrastructure and it is freely available."
  • Real-world evidence + patient access data — the post does not detail data-source specifics for this leg.

The post emphasises the architecture-level claim: institutional sponsor data is the training data, not industry-aggregate data from a CRO scoring product. "Every new study makes the prediction better, and every new site relationship is reflected in the next training run."

Diversity as a first-class scoring dimension

Per the post: "Diversity considerations are a first-class scoring dimension, aligned with FDA's Diversity Action Plan expectations under FDORA 2022."

The architectural enabler is the governed SHAP attribution Delta table: "Sponsors can audit recommendations for systematic under-weighting of community sites, minority-serving institutions, or first-time investigators — turning explainability into a fairness control." The fairness audit is a SQL query against the queryable-attribution population, not a per-prediction inspection.

What it is not

Per the post: "This is a decision-support layer, not a source-of-record system. The CTMS/EDC/IRT remain authoritative. The workbench produces predictions whose lineage is governed in Unity Catalog and MLflow."

Position in the broader Clinical Operations Intelligence Hub

The post names the Site Feasibility Workbench as "the first public release of a broader architecture — the Databricks Clinical Operations Intelligence Hub — covering the full trial lifecycle":

  • Site Feasibility and Selection — this Workbench.
  • Patient Cohort and Recruitment — protocol-aligned cohort building from EHR + RWE at Lakehouse scale.
  • Enrollment Velocity Optimizer — ML stall prediction per site per month with a 1–3 month forward horizon.
  • Risk-Based Monitoring and Compliance — continuous monitoring for enrollment anomalies, data lags, and protocol deviations.

"All four deploy as Databricks Apps. All four query Unity Catalog directly. None make external API calls."

Why this matters for system design

The Site Feasibility Workbench is the wiki's first canonical instance of a fully open-source Databricks App that can be inspected as a reference for the single-platform architecture pattern. It's the existence proof that the "Architecture Argument" in the source post can be implemented end-to-end in a deployable artifact, not just described as a thesis.

For practitioners it answers a concrete question: "if I want to build a regulated decision-support app inside a Databricks workspace, what does the actual code shape look like?" — FastAPI + React, SQL Statement API for data, Lakebase for app state, Genie REST API for NL query, MLflow for model versioning, governed Delta tables for SHAP attributions.

Seen in

Last updated · 542 distilled / 1,571 read