Skip to content

CONCEPT Cited by 1 source

Agentic data classification

What it is

Agentic data classification is the use of multi-signal classifiers — combining pattern recognition, schema/metadata features, and LLM evaluation over sample rows — to continuously auto-tag sensitive columns in a data warehouse / lake-house with machine-readable governance metadata, instead of relying on humans or hand-coded regex rules to keep up with schema growth.

The "agentic" framing captures three properties together: (a) the classifier evaluates LLM-style natural-language signals over sampled data + metadata, (b) it incorporates feedback (false-positive exclusions, newly tagged columns) into future scans, and (c) it runs on a schedule + on new data rather than only on demand.

Why it matters

The structural problem agentic classification solves is the pace-of-data property: "New tables and data records arrive continuously, and the business expects to use them right away. If detection relies on humans, or on detection logic hand-coded into individual pipelines for every type of data that comes in, it will lag behind both the data and the demand" (Source: sources/2026-05-13-databricks-abac-row-filtering-and-column-masking-policies-governed-tags).

Manual classification has three failure modes the agentic approach addresses:

  1. Backlog accumulates faster than humans can review — the continuous-scan property closes this gap.
  2. Regex-only rules miss semantic variants — pattern rules for email miss customer_contact, recipient_addr, notification_to, etc. LLM evaluation reads column metadata + sampled values together and recognises the semantic role of the column.
  3. Custom business categories require code — handcrafted classifiers per company-specific category (e.g., customer_loyalty_tier) require ML engineers. Agentic classification with custom classifiers can learn the pattern from a few example tagged columns instead.

Three signals composed

Signal What it contributes
Pattern recognition Standard PII patterns: SSN format, credit-card Luhn check, phone-number regional patterns. Fast, high-precision on canonical formats; brittle on variants.
Metadata Column name, data type, comments, table-name context. Catches email_addr even when sample values are nulled out in the scan.
LLM evaluation Reads sampled rows + metadata together; recognises semantic categories without explicit pattern rules. Higher recall on variants and custom categories; cost-sensitive (per-column LLM cost).

The agentic classifier composes all three in one pass and emits a governed tag as output. The output substrate matters because it lets downstream attribute-based access control consume the classifications without separate integration — same vocabulary between human stewards and automated classifiers.

Custom classifiers — learning from existing tags

The custom-classifier shape (Unity Catalog Beta at GA) is the inversion of the rule-author / rule-applier relationship:

  • Old shape: a domain expert writes rules → rules apply to data.
  • New shape: a domain expert tags a few columns by hand → the classifier learns the pattern → classifier applies to all matching columns.

The training signal is "existing tagged columns and surrounding […] metadata" (Source: sources/2026-05-13-databricks-abac-row-filtering-and-column-masking-policies-governed-tags). This shifts the human work from rule authoring (which requires articulating a pattern formally) to example tagging (which only requires recognising the category in instances). For business-specific categories that resist formal pattern rules — like customer_loyalty_tier_5_plus_gold_plus_platinum — example tagging is often the only feasible authoring mode.

Human-in-the-loop closes the precision gap

Agentic classifiers, especially LLM-based, produce false positives. The architectural shape that absorbs this is a false-positive exclusion list that feeds back into future scans:

  • Steward sees a tag they disagree with on a column.
  • Steward marks the (column × tag) pair as a false positive.
  • Future scans treat this column as known-not-this-tag.
  • Over time, the classifier's effective precision on the operational catalog rises without retraining the underlying model.

(Source: same: "users can exclude any false positive detections from being tagged, which continuously improves precision of future scans.")

This is similar in shape to LLM-as-judge's human-aligned criteria refinement loop, but operating at the per-column tag-application granularity instead of at the rubric granularity.

Distinguishing from sibling primitives

  • Static / regex-only classifier — fast, deterministic, low recall on semantic variants. Often the prior shape.
  • Meta's ML classifier — sibling at a different consumer altitude (drives data-annotation labels for IFC enforcement rather than warehouse-governance tags for ABAC).
  • Manual SOC-2 column inventory — the human-only baseline; what agentic classification automates away.
  • Hand-coded ingestion-pipeline tagging — bespoke rules embedded in each data pipeline; doesn't survive new data sources or pipeline rewrites.

Operational properties

  • Continuous scan — runs on schedule + on new tables; non-event-driven so it doesn't depend on each pipeline integrating classification.
  • Per-column LLM cost — sampling + LLM inference cost is per-column, scales with column count + scan frequency.
  • Tag substrate — output is the same governed-tag vocabulary human stewards use, so the consumers (ABAC, audit, dashboards) treat human-applied and classifier-applied tags identically.
  • Coverage dashboard — the classifier output is observable as a population-level coverage view: what fraction of columns are classified, what tags appear most frequently, which classifications lack ABAC policies enforcing protection.

Seen in

  • sources/2026-05-13-databricks-abac-row-filtering-and-column-masking-policies-governed-tags — Unity Catalog Data Classification GA. "Using proven pattern recognition, metadata, and large language models, it delivers higher accuracy than manual or regex-only tools." Built-in classifiers cover GDPR / HIPAA / GLBA / DPDPA / PCI plus regional packs. Custom classifiers in Beta learn from existing tagged columns and Unity Catalog metadata. Human-in-the-loop FP exclusion. Output substrate: governed tags consumed by ABAC.
Last updated · 542 distilled / 1,571 read