SYSTEM Cited by 1 source

Skimmer (Cloudflare PII detection scanner)¶

Skimmer is the PII detection scanner inside Town Lake — Cloudflare's data platform. Introduced publicly in the 2026-05-28 launch post as the engine that drives Town Lake's default-closed governance.

What it does¶

"Skimmer is a PII detection scanner. It runs continuously, samples rows from every column in every table, and uses Workers AI to classify whether each column contains PII."

Continuous (not one-shot) — every table, every column, sampled periodically. The post stresses that default-closed governance depends on this: tables stay inaccessible until reviewed, which is only operationally tractable if classification is automated, fast, and reasonably accurate.

Two-pass architecture¶

The structurally distinctive shape:

"It does this in two passes: first, a fast per-column classifier; then, if anything is flagged, an agentic second pass that gets full table context and can query Trino directly to verify."

table created / connected
        │
        ▼
Pass 1: per-column fast classifier ([Workers AI](<./workers-ai.md>))
        │
        ├── nothing flagged ──► register as pending review (column-level)
        │
        └── flagged ──► Pass 2: agentic classifier
                              ├── reads full table context
                              ├── can query Trino directly to verify
                              └── findings → DataHub + Lifeguard allowlist (pending)
                                            │
                                            ▼
                              human reviewer: approve / override / deny
                                            │
                                            ▼
                              column unlocked OR explicitly denied

The structural argument: per-column classification has limited recall on opaque IDs that are only PII when joined (e.g., an internal ID that maps to a user via a separate lookup). The agentic pass takes table-level context to verify — it can run Trino queries to check distribution, sample values, or join against known dimension tables. Canonicalised at patterns/two-pass-pii-classifier-with-agentic-second-pass.

The classifier coverage envelope, per the post:

"It catches obvious PII (emails, IPs, names, phone numbers) and the long tail of non-obvious sensitive data (API tokens that match certain prefixes, opaque IDs that can be traced back to users)."

Output destinations¶

Findings flow to two places:

DataHub — the metadata catalog stores per-column classification labels alongside schema, lineage, and ownership.
Lifeguard's allowlist — registered as pending until human review approves / overrides / denies.

Human-in-the-loop review workflow¶

"Reviewers see what was detected and either approve, override, or deny. Most reviews take seconds."

The fast-classifier-then-fast-review combination is what makes default-closed governance not become operationally hostile. The post is explicit: "This sounds painful, and it would be, except for two things. First, it's automated... Second, the workflow is self-serve" — Skimmer is half of "first."

Why classification has to be continuous¶

The post implies (without stating directly) why a one-shot scan isn't enough: schemas evolve, columns are added, ETL writes new tables. "When a new database is connected to Trino or a new table is created, Skimmer scans it" — the trigger is structural, not periodic, but the runtime is also continuous (sampling existing tables for drift / new values).

Seen in¶

sources/2026-05-28-cloudflare-how-we-built-cloudflares-data-platform-and-an-ai-agent-on-top-of-it — canonical wiki source. Two-pass architecture, output destinations, review workflow.

systems/cloudflare-town-lake — the platform Skimmer scans.
systems/cloudflare-lifeguard — Skimmer feeds the allowlist Lifeguard enforces.
systems/workers-ai — the LLM-inference primitive Skimmer's classifier runs on.
systems/datahub — the metadata catalog that stores Skimmer's findings alongside schema + lineage.
systems/trino — the query engine the agentic second pass uses to verify findings against table data.
concepts/default-closed-table-allowlist — the governance shape Skimmer is the load-bearing automation for.
concepts/agentic-data-classification — the related Databricks framing (Unity Catalog Data Classification).
patterns/two-pass-pii-classifier-with-agentic-second-pass — the canonical wiki pattern.
patterns/default-closed-allowlist-with-automated-pii-scan — the broader governance pattern.
companies/cloudflare