Skip to content

SYSTEM Cited by 1 source

Skimmer (Cloudflare PII detection scanner)

Skimmer is the PII detection scanner inside Town LakeCloudflare's data platform. Introduced publicly in the 2026-05-28 launch post as the engine that drives Town Lake's default-closed governance.

What it does

"Skimmer is a PII detection scanner. It runs continuously, samples rows from every column in every table, and uses Workers AI to classify whether each column contains PII."

Continuous (not one-shot) — every table, every column, sampled periodically. The post stresses that default-closed governance depends on this: tables stay inaccessible until reviewed, which is only operationally tractable if classification is automated, fast, and reasonably accurate.

Two-pass architecture

The structurally distinctive shape:

"It does this in two passes: first, a fast per-column classifier; then, if anything is flagged, an agentic second pass that gets full table context and can query Trino directly to verify."

table created / connected
Pass 1: per-column fast classifier ([Workers AI](<./workers-ai.md>))
        ├── nothing flagged ──► register as pending review (column-level)
        └── flagged ──► Pass 2: agentic classifier
                              ├── reads full table context
                              ├── can query Trino directly to verify
                              └── findings → DataHub + Lifeguard allowlist (pending)
                              human reviewer: approve / override / deny
                              column unlocked OR explicitly denied

The structural argument: per-column classification has limited recall on opaque IDs that are only PII when joined (e.g., an internal ID that maps to a user via a separate lookup). The agentic pass takes table-level context to verify — it can run Trino queries to check distribution, sample values, or join against known dimension tables. Canonicalised at patterns/two-pass-pii-classifier-with-agentic-second-pass.

The classifier coverage envelope, per the post:

"It catches obvious PII (emails, IPs, names, phone numbers) and the long tail of non-obvious sensitive data (API tokens that match certain prefixes, opaque IDs that can be traced back to users)."

Output destinations

Findings flow to two places:

  • DataHub — the metadata catalog stores per-column classification labels alongside schema, lineage, and ownership.
  • Lifeguard's allowlist — registered as pending until human review approves / overrides / denies.

Human-in-the-loop review workflow

"Reviewers see what was detected and either approve, override, or deny. Most reviews take seconds."

The fast-classifier-then-fast-review combination is what makes default-closed governance not become operationally hostile. The post is explicit: "This sounds painful, and it would be, except for two things. First, it's automated... Second, the workflow is self-serve" — Skimmer is half of "first."

Why classification has to be continuous

The post implies (without stating directly) why a one-shot scan isn't enough: schemas evolve, columns are added, ETL writes new tables. "When a new database is connected to Trino or a new table is created, Skimmer scans it" — the trigger is structural, not periodic, but the runtime is also continuous (sampling existing tables for drift / new values).

Seen in

Last updated · 542 distilled / 1,571 read