Skip to content

SYSTEM Cited by 1 source

Unity Catalog Data Classification

Unity Catalog Data Classification is Unity Catalog's agentic sensitive-data-detection engine — it scans tables continuously, identifies PII / PHI / regulatory-category data, and writes the matching governed tag onto the column. Reached General Availability on 2026-05-13 with custom classifiers in Beta (Source: sources/2026-05-13-databricks-abac-row-filtering-and-column-masking-policies-governed-tags).

The detection pipeline uses "proven pattern recognition, metadata, and large language models" and delivers "higher accuracy than manual or regex-only tools" (Source: same). Output is a governed tag, which means a downstream ABAC policy referencing that tag automatically starts protecting the newly classified column — no manual wiring step.

Architectural shape

new table arrives in catalog
┌──────────────────────────────────────┐
│  classifier scan                     │
│  ──────────────────                  │
│  • pattern-recognition signals       │
│  • column metadata signals (name,    │
│    type, comments)                   │
│  • LLM evaluation over sample rows   │
│  • surrounding-Unity-Catalog context │
│    (lineage, related tables)         │
└──────────────────────────────────────┘
classifier emits governed tag(s)
tag attached to column → ABAC policies
referencing that tag start matching
new column protected the next time it is queried

GA-disclosed properties

Built-in compliance classifiers

GA expanded compliance coverage. Built-in classifiers cover:

  • GDPR (EU general data protection)
  • HIPAA (US healthcare)
  • GLBA (US financial)
  • DPDPA (India data protection — though regional pack still shipping)
  • PCI (payment card)

Plus regional packs for UK, Germany, Australia, Brazil, with India and Canada coming this month (the GA post month, May 2026).

(Source: sources/2026-05-13-databricks-abac-row-filtering-and-column-masking-policies-governed-tags; "New classifiers cover GDPR, HIPAA, GLBA, DPDPA, and PCI, alongside regional support across the UK, Germany, Australia, and Brazil. Additional classifiers for India and Canada will be coming this month.")

Custom classifiers (Beta at GA)

"Give Data Classification any Governed Tag and the system will automatically identify matching columns. Detection patterns are learned from existing tagged columns and surrounding Unity Catalog metadata, automatically fitting to your data." (Source: same.)

The architectural property: stewards do not specify detection patterns explicitly. They tag a few example columns, the classifier learns the pattern, and the system extrapolates to every matching column in the catalog. The training signal is the existing tags — which are themselves either applied by humans or by previous classifier passes. This produces a self-reinforcing feedback loop between classification and tagging.

See concepts/agentic-data-classification for the broader concept.

Human-in-the-loop validation

"Customer feedback and quality evaluations have further improved detection accuracy. Additionally, users can exclude any false positive detections from being tagged, which continuously improves precision of future scans." (Source: same.)

The exclusion list functions as negative training data for the classifier. False-positive feedback is the operational mechanism for moving precision in production without retraining.

Consolidated visibility dashboard

"View all classifications detected across a workspace and drill down into where they were found, who has access, and where ABAC policies need to be created for protection." (Source: same.)

The dashboard composes three pieces of information per classification: (a) where the data is (catalogs / schemas / tables / columns matching), (b) who has access (which principals can currently see it), (c) whether protection exists (whether an ABAC policy referencing the tag is in place). The third piece is the "organize → detect → protect" coverage gap surface — classifications without a matching ABAC policy are unprotected detections.

Composition with the rest of the GA pipeline

Classification is the detect step in the organize → detect → protect pipeline framed across all three GA capabilities:

  1. Organize — governance teams establish the governed tag taxonomy.
  2. Detect — Data Classification runs continuously, writing tags on newly arriving data.
  3. ProtectABAC policies that reference the tags evaluate at query time, applying row filters and column masks.

The load-bearing engineering claim is "there is no handoff between systems, and no manual step between discovery and protection" — the three capabilities operate within Unity Catalog so the substrate is unified.

What's not disclosed

  • Classifier accuracy numbers: claims of "higher accuracy than manual or regex-only tools" are not quantified with precision / recall / F1.
  • LLM cost / latency: scanning every column with an LLM has cost implications that are not disclosed (model used, sample size per column, scan frequency).
  • Custom classifier training-data minimum: how many example tagged columns are needed to learn a useful pattern.
  • Confidence-score surfacing: whether classification confidence is exposed to stewards or only the binary tag/no-tag.
  • Multi-tag emission: whether one column can be tagged by multiple classifiers in one scan.
  • Streaming-data classification: whether tables under continuous ingest are classified once on creation or rescanned on schema changes.

Seen in

  • sources/2026-05-13-databricks-abac-row-filtering-and-column-masking-policies-governed-tags — GA announcement; built-in classifier coverage (GDPR / HIPAA / GLBA / DPDPA / PCI + UK / Germany / Australia / Brazil regional packs), custom classifiers in Beta, human-in-the-loop FP exclusion, consolidated visibility dashboard. Customer testimonial from Nan Wu (Software Engineer, Superhuman): "agentic Data Classification replaces manual overhead with automated, high- quality results that scale cost more with value. Data Classification can help provide continuous visibility into where key data lives across our environments. Custom classifiers can adapt to our specific data patterns, helping streamline access and compliance management."
Last updated · 542 distilled / 1,571 read