PATTERN Cited by 1 source

SHAP attribution as governed Delta table¶

Pattern: when a regulated ML decision-support system makes a prediction, write the prediction's Shapley-value attribution to a governed Delta table in Unity Catalog — alongside the prediction, lineaged through UC to the training data, versioned in systems/mlflow by the model that produced it. The attribution becomes a first-class queryable artifact with the same governance posture as any other production table.

The architectural payoff: "the rationale behind a site selection is as auditable as the score itself" — every regulator's question becomes a SQL query, every fairness audit becomes a population-level aggregate, every model-version trace becomes a one-line MLflow lookup.

Canonical wiki instance: sources/2026-05-13-databricks-clinical-operations-intelligence-belongs-on-the-lakehouse — "Every prediction carries a SHAP attribution stored as a governed Unity Catalog Delta table — versioned in MLflow, lineaged through Unity Catalog, queryable."

Implementation shape¶

# At inference time, inside the prediction service:
prediction, shap_values = model.predict_with_explanation(features)

# Write prediction + attribution as one atomic write into UC.
spark.createDataFrame([{
    "recommendation_id":   uuid4(),
    "model_version":        mlflow.active_run().info.run_id,
    "prediction":           prediction,
    "feature_values":       features,            # struct
    "shap_attributions":    shap_values,         # array<struct<feature, contribution>>
    "predicted_at":         now(),
}]).write.format("delta") \
   .mode("append") \
   .saveAsTable("clinops.audit.site_feasibility_attributions")

The table is registered in UC with the same ABAC policies, governed tags, and data classifiers as any other production table. PHI handling rides on the catalog's HIPAA Safe Harbor / Expert Determination posture configured at the catalog or schema level.

Three property guarantees¶

The pattern's load-bearing property guarantees:

Temporal correctness via MLflow versioning. "Versioned in MLflow" — every row carries the model-version identifier; the audit chain leads to the exact model version that produced the prediction, not the current production version.
Upstream completeness via UC lineage. "Lineaged through Unity Catalog" — the lineage graph traces the prediction backwards to the training-data tables, the feature-engineering pipelines, and the data sources. A regulator can walk the chain end-to-end inside one governance system.
Population queryability via SQL. "Queryable" — Delta is the substrate, so per-prediction inspection (SELECT WHERE recommendation_id) and population aggregation (GROUP BY site_type) both work directly without ETL.

When the pattern applies¶

Regulated ML decision-support systems where explainability is required not optional. The 2026-05-13 source frames this through three regulatory drivers: 21 CFR Part 11 (electronic records and signatures), ICH E6(R3) (good clinical practice), and FDA GMLP (good machine learning practice).
Fairness controls require population-level audit. Per the source: "Sponsors can audit recommendations for systematic under-weighting of community sites, minority-serving institutions, or first-time investigators — turning explainability into a fairness control." Systematic bias can only be detected by aggregation over a queryable population.
The substrate already has UC + MLflow + Delta. The pattern composes onto an existing Databricks-style governed Lakehouse; on other substrates the equivalent shape (registry + lineage + ACID storage with population queries) is needed before this pattern can be applied directly.

When it doesn't fit¶

Pure-text generative models where a Shapley-value-style attribution doesn't map to a feature vector. (LLM-as-judge audit shapes belong elsewhere — see concepts/llm-as-judge and patterns/llm-judge-as-inline-pipeline-stage.)
High-throughput recommendation systems where attribution generation cost is prohibitive. SHAP attribution at every inference is expensive; the pattern fits regulated low-volume high-stakes decision-support, not consumer-scale recommendation ranking.
Models where the substrate doesn't support time-travel or schema evolution. The pattern assumes you can store attribution rows from old model versions alongside attribution rows from the current model — Delta time-travel and schema-on-read make this cheap; non-ACID columnar substrates make it painful.

Trade-offs¶

Axis	Cost	Benefit
Storage	Each prediction stores N feature-contribution values. With ~50 features × ~1M predictions = 50M rows of attributions per training cycle. Delta compresses well, but it's not free.	The attribution population is the substrate for fairness audit, regulatory inquiry, model-debugging, and post-hoc analysis.
Compute	SHAP at inference is 5-50× the cost of the prediction itself for tree-based models, more for deep networks.	The cost is paid once at inference; querying the attribution table later is just a SELECT.
Lineage authoring	Someone has to wire the model-version + UC-lineage references into the write path.	One-time cost; pays off on every audit query.
Schema evolution	When the feature set changes, the attribution schema changes — schema-on-read evolution required.	Delta supports this natively; the audit chain spans schema versions.

Adjacent patterns¶

patterns/governed-delta-tables-per-modality — same Delta-as-governance-substrate framing but for raw multimodal training data. The two patterns pair: training data lands in modality-tagged Delta tables under UC, predictions plus SHAP attributions land in audit-tagged Delta tables under the same UC.
**This pattern
patterns/in-workspace-app-as-decision-support** — the audit table is read by an in-workspace app to render per-prediction explanations to end users via the same SQL Statement API that serves the rest of the app's data path.
patterns/llm-judge-as-inline-pipeline-stage — LLM-judge scores stored as governed Delta tables for pipeline-quality audit; same architectural shape applied at a different ML altitude.
concepts/explainability-log-shaped patterns — the generalised shape of "decision + explanation + version stored together" appears across regulated-ML literature; SHAP-as-governed-Delta is the Lakehouse-native instantiation.

Why the substrate matters¶

The naive alternative — "generate the explanation on demand when a regulator asks" — has two structural failure modes:

The model has changed. A prediction made on day N gets re-explained on day N+90 against the current production model, which is not the model that produced the prediction. Audit chain broken.
Population audit is impossible. The question "are community sites systematically under-weighted?" requires aggregation over thousands of past predictions. An on-demand explainer service can't produce that population — only the storage-as-population substrate can.

The pattern eliminates both failure modes by making the attribution the same kind of artifact as the prediction: stored, versioned, governed, queryable.

Seen in¶

sources/2026-05-13-databricks-clinical-operations-intelligence-belongs-on-the-lakehouse — Canonical wiki instance. Site Feasibility Workbench writes per-recommendation SHAP attributions into a UC-governed Delta table; Databricks Apps reads from the same table for per-prediction explanation drill-downs in the SHAP-driven site deep-dive workflow step. Implementation backbone: TA-segmented LightGBM models trained on the sponsor's CTMS / EDC / IRT history; SHAP attributions versioned in systems/mlflow and lineaged through systems/unity-catalog. Fairness-audit application: under-weighting of community sites, minority-serving institutions, first-time investigators detected by population-level SQL queries against the attribution table.

concepts/governed-shap-attribution-table — the substrate this pattern produces.
concepts/explainable-ai-decision — the explainability primitive this pattern stores.
concepts/single-platform-application-architecture — the architectural shape that makes this audit-substrate coherent end-to-end.
patterns/in-workspace-app-as-decision-support — pairs with this pattern; the app reads the attribution table for end-user explanations.
patterns/governed-delta-tables-per-modality — sibling pattern for training-data substrate.
systems/delta-lake — the storage substrate.
systems/unity-catalog — the governance substrate.
systems/mlflow — the model-version registry.
systems/site-feasibility-workbench — reference implementation.