PATTERN Cited by 1 source
Two-pass PII classifier with agentic second pass¶
A PII / data-classification pattern that layers a fast per-column classifier with an agentic full-table-context classifier. Pass 1 is cheap and runs on every column continuously; Pass 2 is expensive but only runs for columns Pass 1 flags as suspicious. The architectural argument is that per-column classification has limited recall on opaque identifiers that are only PII when joined — the agentic pass takes full table context to close that gap.
Cloudflare Skimmer is the canonical wiki instance, from the Town Lake / Skipper launch post.
The architectural argument¶
"It does this in two passes: first, a fast per-column classifier; then, if anything is flagged, an agentic second pass that gets full table context and can query Trino directly to verify."
Two structural reasons a single-pass classifier doesn't suffice:
- Opaque IDs are PII when joined — an internal
acct_xyzID isn't visibly PII, but if it joins to a user table, it's effectively a user identifier. Per-column inspection can't see the join. - Format-ambiguous values — a column of strings might be product names, internal codes, or hashed emails. Without table context, the classifier can't disambiguate.
The two-pass shape gives both efficiency (most columns are clearly non-PII and never need Pass 2) and recall (the long tail gets the full-context treatment).
Pass 1: per-column fast classifier¶
| Property | Value |
|---|---|
| Substrate | Workers AI (LLM inference, fast model) |
| Input | Sample of values from a single column |
| Output | One of: clearly PII (specific type) / clearly not PII / flagged for review |
| Cost | Cheap per column; runs on every column continuously |
| Throughput | Designed for full-fleet coverage |
"Skimmer's classifier is reasonably good: it catches obvious PII (emails, IPs, names, phone numbers) and the long tail of non-obvious sensitive data (API tokens that match certain prefixes, opaque IDs that can be traced back to users)."
The "obvious PII" set (emails, IPs, names, phone numbers) is where format-based classification works well — patterns are regular and well-known.
The "long tail of non-obvious sensitive data" — API tokens with recognisable prefixes, structured opaque IDs — is where Pass 1 flags but doesn't decide.
Pass 2: agentic classifier with full table context¶
| Property | Value |
|---|---|
| Substrate | LLM agent with tool access |
| Input | The full table schema, the flagged column, sampled values |
| Tools | Direct Trino query access — can run SELECT DISTINCT, JOIN against suspected dimension tables, COUNT(*) GROUP BY, etc. |
| Output | Confirmed classification (specific PII type or not-PII) with rationale |
| Cost | Expensive (LLM calls + Trino queries); runs only for flagged columns |
| Throughput | Lower than Pass 1; bounded by flagging rate |
The agentic shape gives the classifier active investigation authority: it can query the data directly to verify suspicions, rather than relying solely on the sampled rows it was passed.
Why "agentic" specifically (not just "more context")¶
The choice is structural. Three options for closing the recall gap on Pass 1:
- Just give the classifier more rows — doesn't help with opaque IDs that need cross-table joins to verify.
- Pre-compute join statistics — works for known FK relationships, fails on undeclared / informal joins.
- Agentic — the classifier picks the verifying queries based on what it sees in Pass 1's flag, including queries Pass 1 couldn't have anticipated.
The agentic design generalises: as the data platform's structure evolves, the agent can use new tools (new tables, new join paths) without code changes.
Composes with default-closed governance¶
The two-pass classifier is the load-bearing automation behind default-closed table allowlisting. "Reviewers see what was detected and either approve, override, or deny. Most reviews take seconds" is only true if the classifier has both:
- High enough recall to catch the long-tail PII (Pass 2 closes that gap).
- High enough precision that reviewers aren't drowning in false positives (Pass 1 + Pass 2 together).
Sibling pattern at Databricks¶
concepts/agentic-data-classification (Databricks Unity Catalog) is the same structural shape — classifier + agentic verification — described at the architectural level. The Cloudflare Skimmer post is the wiki's first explicit two-pass implementation with the cost-rationale named.
Anti-patterns¶
- Single-pass per-column classification — misses opaque-ID PII; classifier output is treated as ground truth without table-level verification.
- Periodic full-table scans — too expensive to run as often as needed; misses fresh data between scans.
- Always-agentic — running the agentic classifier on every column would be cost-prohibitive; the per-column fast pass is the cost-control mechanism.
Seen in¶
- sources/2026-05-28-cloudflare-how-we-built-cloudflares-data-platform-and-an-ai-agent-on-top-of-it — canonical wiki instance. Skimmer's two-pass shape; the "agentic second pass that gets full table context and can query Trino directly to verify" phrasing.
Related¶
- systems/cloudflare-skimmer — canonical wiki instance.
- systems/cloudflare-town-lake — the platform.
- systems/workers-ai — Pass 1 substrate.
- systems/trino — Pass 2 verification substrate.
- systems/datahub — output destination.
- concepts/default-closed-table-allowlist — the governance posture this pattern is the load-bearing automation for.
- concepts/agentic-data-classification — sibling Databricks framing.
- patterns/default-closed-allowlist-with-automated-pii-scan — the broader pattern this fits inside.