Skip to content

PATTERN Cited by 1 source

Default-closed allowlist with automated PII scan

The implementation pattern that makes default-closed table allowlisting operationally tractable. New tables enter the data platform in a pending state; an automated PII / data-classification scanner runs against every column; findings populate an allowlist; a human reviewer approves / overrides / denies; only then is the table queryable. Three substrates collaborate.

Cloudflare Town Lake is the canonical wiki instance, from the 2026-05-28 launch post.

Three-substrate decomposition

Substrate Role Town Lake instance
Automated classifier Continuously scans every column, classifies for PII / sensitive data Skimmer
Allowlist registry Stores per-table / per-column review state (pending / approved / denied), feeds the engine's policy Lifeguard (D1-backed, JSON policy served to Trino over HTTP)
Human review workflow UI for reviewers to approve / override / deny detected classifications (not named separately in the post — likely a thin Worker-backed UI on top of D1)

End-to-end flow

new database connected to Trino    OR    new table created
                  │                                │
                  └────────┬───────────────────────┘
              Skimmer scan (continuous)
                  ├── per-column fast classifier (Workers AI)
                  │   ├── nothing flagged → register column (pending)
                  │   └── flagged → agentic 2nd pass with table context
                  │                          ↓
                  │                   findings → DataHub
                  │                              ↓
                  │                   register findings → Lifeguard allowlist (pending)
        Reviewer UI shows pending classifications
                  ├── approve → column unlocked
                  ├── override → reclassify, then approve / deny
                  └── deny → column permanently locked
        Lifeguard policy refreshed
        Trino reads policy over HTTP on next query
        Column-level access enforced; Skipper agent
        sees the same allowlist for self-serve UX

What makes this tractable (vs operationally hostile)

The post is explicit:

"This sounds painful, and it would be, except for two things. First, it's automated. Skimmer's classifier is reasonably good: it catches obvious PII (emails, IPs, names, phone numbers) and the long tail of non-obvious sensitive data (API tokens that match certain prefixes, opaque IDs that can be traced back to users). Reviewers see what was detected and either approve, override, or deny. Most reviews take seconds.

Second, the workflow is self-serve. If you query a table you don't have access to, the error message is not 'permission denied.' It's 'this table needs review, click here to request one.' Skipper, the AI agent, will even suggest the right RBAC group to request and link you straight to it."

Three load-bearing affordances:

  1. Classifier recall is reasonably good — without that, every review becomes a manual classification effort, not a sign-off.
  2. Reviews take seconds — the bottleneck isn't review throughput.
  3. Self-serve permission requests — the error-as-request pattern converts denials into a workflow, not a wall.

Schema discovery is decoupled from access

A subtle but load-bearing affordance: users can see what tables exist even before approval, but unreviewed columns are hidden from DESCRIBE / SHOW COLUMNS / SELECT *. New columns can land without breaking dashboards. Canonicalised at concepts/schema-discovery-vs-data-access-separation — without this property, every schema migration would force a review-queue flush before consumers stopped breaking.

Two-pass classifier as the recall-vs-cost dial

The Skimmer-specific implementation choice — "first, a fast per-column classifier; then, if anything is flagged, an agentic second pass that gets full table context and can query Trino directly to verify" — is canonicalised at patterns/two-pass-pii-classifier-with-agentic-second-pass. It's the recall-vs-cost dial for the automated layer:

  • Fast per-column classifier alone has limited recall on opaque IDs that are only PII when joined.
  • Agentic full-context second pass closes the recall gap but is expensive — only run for flagged columns.

Composes with column-level governance

The pattern naturally composes with column-level masking and ABAC tag-driven policies (see patterns/tag-driven-attribute-based-access-control for the Databricks instance). Per-column classification makes per-column masking expressible:

  • Skimmer marks users.email as PII:email.
  • Lifeguard's policy: CASE WHEN user.has(grant:raw-pii) THEN email ELSE mask(email) END.
  • Trino enforces at query time.

Combined with opt-in PII redaction per session, this is the column-level counterpart to the table-level default-closed allowlist.

Anti-patterns this replaces

  • "Open by default, scan asynchronously, lock down later" — sensitive data leaks before the scan completes; scanning becomes evidence-of-compromise rather than prevention.
  • "Manual classification" — reviewers spend more time classifying than approving; throughput collapses; teams skip reviews; coverage drops.
  • "Permission denied with no path forward" — users hit walls, file tickets, lose trust in the platform, route around it.

Seen in

Last updated · 542 distilled / 1,571 read