SYSTEM Cited by 1 source
Skimmer (Cloudflare PII detection scanner)¶
Skimmer is the PII detection scanner inside Town Lake — Cloudflare's data platform. Introduced publicly in the 2026-05-28 launch post as the engine that drives Town Lake's default-closed governance.
What it does¶
"Skimmer is a PII detection scanner. It runs continuously, samples rows from every column in every table, and uses Workers AI to classify whether each column contains PII."
Continuous (not one-shot) — every table, every column, sampled periodically. The post stresses that default-closed governance depends on this: tables stay inaccessible until reviewed, which is only operationally tractable if classification is automated, fast, and reasonably accurate.
Two-pass architecture¶
The structurally distinctive shape:
"It does this in two passes: first, a fast per-column classifier; then, if anything is flagged, an agentic second pass that gets full table context and can query Trino directly to verify."
table created / connected
│
▼
Pass 1: per-column fast classifier ([Workers AI](<./workers-ai.md>))
│
├── nothing flagged ──► register as pending review (column-level)
│
└── flagged ──► Pass 2: agentic classifier
├── reads full table context
├── can query Trino directly to verify
└── findings → DataHub + Lifeguard allowlist (pending)
│
▼
human reviewer: approve / override / deny
│
▼
column unlocked OR explicitly denied
The structural argument: per-column classification has limited recall on opaque IDs that are only PII when joined (e.g., an internal ID that maps to a user via a separate lookup). The agentic pass takes table-level context to verify — it can run Trino queries to check distribution, sample values, or join against known dimension tables. Canonicalised at patterns/two-pass-pii-classifier-with-agentic-second-pass.
The classifier coverage envelope, per the post:
"It catches obvious PII (emails, IPs, names, phone numbers) and the long tail of non-obvious sensitive data (API tokens that match certain prefixes, opaque IDs that can be traced back to users)."
Output destinations¶
Findings flow to two places:
- DataHub — the metadata catalog stores per-column classification labels alongside schema, lineage, and ownership.
- Lifeguard's allowlist — registered as pending until human review approves / overrides / denies.
Human-in-the-loop review workflow¶
"Reviewers see what was detected and either approve, override, or deny. Most reviews take seconds."
The fast-classifier-then-fast-review combination is what makes default-closed governance not become operationally hostile. The post is explicit: "This sounds painful, and it would be, except for two things. First, it's automated... Second, the workflow is self-serve" — Skimmer is half of "first."
Why classification has to be continuous¶
The post implies (without stating directly) why a one-shot scan isn't enough: schemas evolve, columns are added, ETL writes new tables. "When a new database is connected to Trino or a new table is created, Skimmer scans it" — the trigger is structural, not periodic, but the runtime is also continuous (sampling existing tables for drift / new values).
Seen in¶
- sources/2026-05-28-cloudflare-how-we-built-cloudflares-data-platform-and-an-ai-agent-on-top-of-it — canonical wiki source. Two-pass architecture, output destinations, review workflow.
Related¶
- systems/cloudflare-town-lake — the platform Skimmer scans.
- systems/cloudflare-lifeguard — Skimmer feeds the allowlist Lifeguard enforces.
- systems/workers-ai — the LLM-inference primitive Skimmer's classifier runs on.
- systems/datahub — the metadata catalog that stores Skimmer's findings alongside schema + lineage.
- systems/trino — the query engine the agentic second pass uses to verify findings against table data.
- concepts/default-closed-table-allowlist — the governance shape Skimmer is the load-bearing automation for.
- concepts/agentic-data-classification — the related Databricks framing (Unity Catalog Data Classification).
- patterns/two-pass-pii-classifier-with-agentic-second-pass — the canonical wiki pattern.
- patterns/default-closed-allowlist-with-automated-pii-scan — the broader governance pattern.
- companies/cloudflare