CONCEPT Cited by 1 source

Agentic data classification¶

What it is¶

Agentic data classification is the use of multi-signal classifiers — combining pattern recognition, schema/metadata features, and LLM evaluation over sample rows — to continuously auto-tag sensitive columns in a data warehouse / lake-house with machine-readable governance metadata, instead of relying on humans or hand-coded regex rules to keep up with schema growth.

The "agentic" framing captures three properties together: (a) the classifier evaluates LLM-style natural-language signals over sampled data + metadata, (b) it incorporates feedback (false-positive exclusions, newly tagged columns) into future scans, and (c) it runs on a schedule + on new data rather than only on demand.

Why it matters¶

The structural problem agentic classification solves is the pace-of-data property: "New tables and data records arrive continuously, and the business expects to use them right away. If detection relies on humans, or on detection logic hand-coded into individual pipelines for every type of data that comes in, it will lag behind both the data and the demand" (Source: sources/2026-05-13-databricks-abac-row-filtering-and-column-masking-policies-governed-tags).

Manual classification has three failure modes the agentic approach addresses:

Backlog accumulates faster than humans can review — the continuous-scan property closes this gap.
Regex-only rules miss semantic variants — pattern rules for email miss customer_contact, recipient_addr, notification_to, etc. LLM evaluation reads column metadata + sampled values together and recognises the semantic role of the column.
Custom business categories require code — handcrafted classifiers per company-specific category (e.g., customer_loyalty_tier) require ML engineers. Agentic classification with custom classifiers can learn the pattern from a few example tagged columns instead.

Three signals composed¶

Signal	What it contributes
Pattern recognition	Standard PII patterns: SSN format, credit-card Luhn check, phone-number regional patterns. Fast, high-precision on canonical formats; brittle on variants.
Metadata	Column name, data type, comments, table-name context. Catches `email_addr` even when sample values are nulled out in the scan.
LLM evaluation	Reads sampled rows + metadata together; recognises semantic categories without explicit pattern rules. Higher recall on variants and custom categories; cost-sensitive (per-column LLM cost).

The agentic classifier composes all three in one pass and emits a governed tag as output. The output substrate matters because it lets downstream attribute-based access control consume the classifications without separate integration — same vocabulary between human stewards and automated classifiers.

Custom classifiers — learning from existing tags¶

The custom-classifier shape (Unity Catalog Beta at GA) is the inversion of the rule-author / rule-applier relationship:

Old shape: a domain expert writes rules → rules apply to data.
New shape: a domain expert tags a few columns by hand → the classifier learns the pattern → classifier applies to all matching columns.

The training signal is "existing tagged columns and surrounding […] metadata" (Source: sources/2026-05-13-databricks-abac-row-filtering-and-column-masking-policies-governed-tags). This shifts the human work from rule authoring (which requires articulating a pattern formally) to example tagging (which only requires recognising the category in instances). For business-specific categories that resist formal pattern rules — like customer_loyalty_tier_5_plus_gold_plus_platinum — example tagging is often the only feasible authoring mode.

Human-in-the-loop closes the precision gap¶

Agentic classifiers, especially LLM-based, produce false positives. The architectural shape that absorbs this is a false-positive exclusion list that feeds back into future scans:

Steward sees a tag they disagree with on a column.
Steward marks the (column × tag) pair as a false positive.
Future scans treat this column as known-not-this-tag.
Over time, the classifier's effective precision on the operational catalog rises without retraining the underlying model.

(Source: same: "users can exclude any false positive detections from being tagged, which continuously improves precision of future scans.")

This is similar in shape to LLM-as-judge's human-aligned criteria refinement loop, but operating at the per-column tag-application granularity instead of at the rubric granularity.

Distinguishing from sibling primitives¶

Static / regex-only classifier — fast, deterministic, low recall on semantic variants. Often the prior shape.
Meta's ML classifier — sibling at a different consumer altitude (drives data-annotation labels for IFC enforcement rather than warehouse-governance tags for ABAC).
Manual SOC-2 column inventory — the human-only baseline; what agentic classification automates away.
Hand-coded ingestion-pipeline tagging — bespoke rules embedded in each data pipeline; doesn't survive new data sources or pipeline rewrites.

Operational properties¶

Continuous scan — runs on schedule + on new tables; non-event-driven so it doesn't depend on each pipeline integrating classification.
Per-column LLM cost — sampling + LLM inference cost is per-column, scales with column count + scan frequency.
Tag substrate — output is the same governed-tag vocabulary human stewards use, so the consumers (ABAC, audit, dashboards) treat human-applied and classifier-applied tags identically.
Coverage dashboard — the classifier output is observable as a population-level coverage view: what fraction of columns are classified, what tags appear most frequently, which classifications lack ABAC policies enforcing protection.

Seen in¶

sources/2026-05-13-databricks-abac-row-filtering-and-column-masking-policies-governed-tags — Unity Catalog Data Classification GA. "Using proven pattern recognition, metadata, and large language models, it delivers higher accuracy than manual or regex-only tools." Built-in classifiers cover GDPR / HIPAA / GLBA / DPDPA / PCI plus regional packs. Custom classifiers in Beta learn from existing tagged columns and Unity Catalog metadata. Human-in-the-loop FP exclusion. Output substrate: governed tags consumed by ABAC.

concepts/data-classification-tagging — the substrate concept; agentic classification is the auto-tagging variant.
concepts/governed-tag — output language.
concepts/llm-as-judge — adjacent LLM-in-the-loop pattern at data-pipeline-quality altitude.
systems/unity-catalog-data-classification — the canonical Databricks instance.
systems/unity-catalog-governed-tags — output destination.
systems/meta-data-classifier — sibling at IFC-consumer altitude.
patterns/tag-driven-attribute-based-access-control — the consumer pattern that closes the loop from classification to enforcement.