Skip to content

CONCEPT Cited by 1 source

Data-centric AI governance

Definition

Data-centric AI governance is the principle — canonicalised in Databricks' 2026-05-20 four-pillars post — that an agent's behaviour is almost entirely determined by the data it has access to, so AI governance and data governance must be one system, not two. The classifier of an architecture as data-centric vs not is whether the same catalog / IAM / policy surface that governs business data also governs AI assets.

Canonical statement on the wiki

"Here's the principle most AI governance tools miss: an agent's behavior is almost entirely determined by the data it has access to. What it can read, how fresh that data is, whether sensitive fields are masked, these aren't AI governance questions. They're data governance questions. Treat them separately, and you end up with two incomplete systems. Treat them together, and governance becomes self-reinforcing." — Source: sources/2026-05-20-databricks-governing-ai-agents-at-scale-with-unity-catalog

The "two incomplete systems" failure mode

The structural failure the principle warns against is the typical state of orgs that bolt AI-specific guardrails onto an existing data platform:

Layer What it knows What it doesn't
AI guardrails layer "Don't leak PII" in the abstract Doesn't know which columns are PII
Data governance layer Knows which columns are PII (data classification) Doesn't know which queries came from agents

Two layers, neither complete. A guardrail that doesn't know the data is just heuristics; a data governance layer that doesn't see agent queries can't apply the right mask at the right time.

The "self-reinforcing" property

"Treat them together, and governance becomes self-reinforcing."

The architectural payoff: the data classification you already have becomes your AI governance automatically. Concretely on Databricks (per the source):

[UC Data Classification](<../systems/unity-catalog-data-classification.md>) continuously scans columns
       │ produces
   PII / HIPAA / GDPR-tagged columns
       │ feeds
   ABAC row filters / column masks ([tag-driven ABAC](<../patterns/tag-driven-attribute-based-access-control.md>))
       │ enforced via
   [OBO](<../patterns/on-behalf-of-agent-authorization.md>) — agent inherits user's permissions
       │ result
   *"Masked columns remain masked regardless of which agent or framework requests them."*

No AI-specific configuration required. The PII tag was always going to drive masking; the agent just happens to be the caller now.

Three concrete properties (per the post)

  1. Audit substrate is data-substrate. "Both land in your lakehouse as tables, retainable on your terms." Inference Tables (model I/O) and UC audit logs (data access) write to the same lakehouse, queryable with the same SQL — so audit is a join, not a tool migration.
  2. Data quality is forensic substrate. "Join it against agent traces, and you move from 'the agent gave a wrong answer' to 'the agent queried a table that had been flagged as stale', connecting agent behavior to the quality of the data underneath it." The agent's answer-quality regression becomes a join between systems/inference-tables and data-quality monitoring tables.
  3. Classification feeds access control automatically. "Data classification adds a further layer: an agentic AI system continuously scans and tags sensitive columns, such as PII, HIPAA and GDPR-regulated data, and those tags feed directly into access control. Masked columns remain masked regardless of which agent or framework requests them." The classification → tag → ABAC pipeline doesn't change just because an agent is the caller — the masking logic lives at the data layer, not the agent layer.

Why the pillar takes the data-centric form

The post's argument is that agent behaviour has three inputs — model, prompt, data — and only the data is durably governable:

  • Model changes on weekly cadences (frontier labs ship new versions).
  • Prompt is non-deterministic by construction (the LLM rewords, expands, re-routes).
  • Data has a stable identity (catalog.schema.table.column) and existing governance ownership.

If governance is bound to model or prompt, governance is unstable. If governance is bound to data, governance survives the agent layer churn. This is the data-half of the same argument governance-travels-with-resources makes for the resource-half.

Sibling primitives in adjacent stacks

What the principle does not claim

  • It does not claim "if you do data governance well, you don't need AI guardrails." The post still ships Guardrails as Pillar 1 layer 3 — the runtime content filter that scans for PII / jailbreak / hallucinations / sensitive content.
  • It does not claim "AI governance is exclusively a data problem." Service Policies and OBO operate above the data layer — they govern tool calls, not data values.
  • The claim is structural: data governance is necessary and self-reinforcing, not exhaustive.

Seen in

Source

Last updated · 542 distilled / 1,571 read