CONCEPT Cited by 1 source
Agentic data classification¶
What it is¶
Agentic data classification is the use of multi-signal classifiers — combining pattern recognition, schema/metadata features, and LLM evaluation over sample rows — to continuously auto-tag sensitive columns in a data warehouse / lake-house with machine-readable governance metadata, instead of relying on humans or hand-coded regex rules to keep up with schema growth.
The "agentic" framing captures three properties together: (a) the classifier evaluates LLM-style natural-language signals over sampled data + metadata, (b) it incorporates feedback (false-positive exclusions, newly tagged columns) into future scans, and (c) it runs on a schedule + on new data rather than only on demand.
Why it matters¶
The structural problem agentic classification solves is the pace-of-data property: "New tables and data records arrive continuously, and the business expects to use them right away. If detection relies on humans, or on detection logic hand-coded into individual pipelines for every type of data that comes in, it will lag behind both the data and the demand" (Source: sources/2026-05-13-databricks-abac-row-filtering-and-column-masking-policies-governed-tags).
Manual classification has three failure modes the agentic approach addresses:
- Backlog accumulates faster than humans can review — the continuous-scan property closes this gap.
- Regex-only rules miss semantic variants — pattern rules for
emailmisscustomer_contact,recipient_addr,notification_to, etc. LLM evaluation reads column metadata + sampled values together and recognises the semantic role of the column. - Custom business categories require code — handcrafted
classifiers per company-specific category (e.g.,
customer_loyalty_tier) require ML engineers. Agentic classification with custom classifiers can learn the pattern from a few example tagged columns instead.
Three signals composed¶
| Signal | What it contributes |
|---|---|
| Pattern recognition | Standard PII patterns: SSN format, credit-card Luhn check, phone-number regional patterns. Fast, high-precision on canonical formats; brittle on variants. |
| Metadata | Column name, data type, comments, table-name context. Catches email_addr even when sample values are nulled out in the scan. |
| LLM evaluation | Reads sampled rows + metadata together; recognises semantic categories without explicit pattern rules. Higher recall on variants and custom categories; cost-sensitive (per-column LLM cost). |
The agentic classifier composes all three in one pass and emits a governed tag as output. The output substrate matters because it lets downstream attribute-based access control consume the classifications without separate integration — same vocabulary between human stewards and automated classifiers.
Custom classifiers — learning from existing tags¶
The custom-classifier shape (Unity Catalog Beta at GA) is the inversion of the rule-author / rule-applier relationship:
- Old shape: a domain expert writes rules → rules apply to data.
- New shape: a domain expert tags a few columns by hand → the classifier learns the pattern → classifier applies to all matching columns.
The training signal is "existing tagged columns and surrounding […]
metadata" (Source:
sources/2026-05-13-databricks-abac-row-filtering-and-column-masking-policies-governed-tags).
This shifts the human work from rule authoring (which requires
articulating a pattern formally) to example tagging (which only
requires recognising the category in instances). For business-specific
categories that resist formal pattern rules — like
customer_loyalty_tier_5_plus_gold_plus_platinum — example tagging is
often the only feasible authoring mode.
Human-in-the-loop closes the precision gap¶
Agentic classifiers, especially LLM-based, produce false positives. The architectural shape that absorbs this is a false-positive exclusion list that feeds back into future scans:
- Steward sees a tag they disagree with on a column.
- Steward marks the (column × tag) pair as a false positive.
- Future scans treat this column as known-not-this-tag.
- Over time, the classifier's effective precision on the operational catalog rises without retraining the underlying model.
(Source: same: "users can exclude any false positive detections from being tagged, which continuously improves precision of future scans.")
This is similar in shape to LLM-as-judge's human-aligned criteria refinement loop, but operating at the per-column tag-application granularity instead of at the rubric granularity.
Distinguishing from sibling primitives¶
- Static / regex-only classifier — fast, deterministic, low recall on semantic variants. Often the prior shape.
- Meta's ML classifier — sibling at a different consumer altitude (drives data-annotation labels for IFC enforcement rather than warehouse-governance tags for ABAC).
- Manual SOC-2 column inventory — the human-only baseline; what agentic classification automates away.
- Hand-coded ingestion-pipeline tagging — bespoke rules embedded in each data pipeline; doesn't survive new data sources or pipeline rewrites.
Operational properties¶
- Continuous scan — runs on schedule + on new tables; non-event-driven so it doesn't depend on each pipeline integrating classification.
- Per-column LLM cost — sampling + LLM inference cost is per-column, scales with column count + scan frequency.
- Tag substrate — output is the same governed-tag vocabulary human stewards use, so the consumers (ABAC, audit, dashboards) treat human-applied and classifier-applied tags identically.
- Coverage dashboard — the classifier output is observable as a population-level coverage view: what fraction of columns are classified, what tags appear most frequently, which classifications lack ABAC policies enforcing protection.
Seen in¶
- sources/2026-05-13-databricks-abac-row-filtering-and-column-masking-policies-governed-tags — Unity Catalog Data Classification GA. "Using proven pattern recognition, metadata, and large language models, it delivers higher accuracy than manual or regex-only tools." Built-in classifiers cover GDPR / HIPAA / GLBA / DPDPA / PCI plus regional packs. Custom classifiers in Beta learn from existing tagged columns and Unity Catalog metadata. Human-in-the-loop FP exclusion. Output substrate: governed tags consumed by ABAC.
Related¶
- concepts/data-classification-tagging — the substrate concept; agentic classification is the auto-tagging variant.
- concepts/governed-tag — output language.
- concepts/llm-as-judge — adjacent LLM-in-the-loop pattern at data-pipeline-quality altitude.
- systems/unity-catalog-data-classification — the canonical Databricks instance.
- systems/unity-catalog-governed-tags — output destination.
- systems/meta-data-classifier — sibling at IFC-consumer altitude.
- patterns/tag-driven-attribute-based-access-control — the consumer pattern that closes the loop from classification to enforcement.