PATTERN Cited by 1 source
Default-closed allowlist with automated PII scan¶
The implementation pattern that makes default-closed table allowlisting operationally tractable. New tables enter the data platform in a pending state; an automated PII / data-classification scanner runs against every column; findings populate an allowlist; a human reviewer approves / overrides / denies; only then is the table queryable. Three substrates collaborate.
Cloudflare Town Lake is the canonical wiki instance, from the 2026-05-28 launch post.
Three-substrate decomposition¶
| Substrate | Role | Town Lake instance |
|---|---|---|
| Automated classifier | Continuously scans every column, classifies for PII / sensitive data | Skimmer |
| Allowlist registry | Stores per-table / per-column review state (pending / approved / denied), feeds the engine's policy | Lifeguard (D1-backed, JSON policy served to Trino over HTTP) |
| Human review workflow | UI for reviewers to approve / override / deny detected classifications | (not named separately in the post — likely a thin Worker-backed UI on top of D1) |
End-to-end flow¶
new database connected to Trino OR new table created
│ │
└────────┬───────────────────────┘
▼
Skimmer scan (continuous)
│
├── per-column fast classifier (Workers AI)
│ ├── nothing flagged → register column (pending)
│ └── flagged → agentic 2nd pass with table context
│ ↓
│ findings → DataHub
│ ↓
│ register findings → Lifeguard allowlist (pending)
▼
Reviewer UI shows pending classifications
│
├── approve → column unlocked
├── override → reclassify, then approve / deny
└── deny → column permanently locked
▼
Lifeguard policy refreshed
▼
Trino reads policy over HTTP on next query
▼
Column-level access enforced; Skipper agent
sees the same allowlist for self-serve UX
What makes this tractable (vs operationally hostile)¶
The post is explicit:
"This sounds painful, and it would be, except for two things. First, it's automated. Skimmer's classifier is reasonably good: it catches obvious PII (emails, IPs, names, phone numbers) and the long tail of non-obvious sensitive data (API tokens that match certain prefixes, opaque IDs that can be traced back to users). Reviewers see what was detected and either approve, override, or deny. Most reviews take seconds.
Second, the workflow is self-serve. If you query a table you don't have access to, the error message is not 'permission denied.' It's 'this table needs review, click here to request one.' Skipper, the AI agent, will even suggest the right RBAC group to request and link you straight to it."
Three load-bearing affordances:
- Classifier recall is reasonably good — without that, every review becomes a manual classification effort, not a sign-off.
- Reviews take seconds — the bottleneck isn't review throughput.
- Self-serve permission requests — the error-as-request pattern converts denials into a workflow, not a wall.
Schema discovery is decoupled from access¶
A subtle but load-bearing affordance: users can see what tables
exist even before approval, but unreviewed columns are hidden
from DESCRIBE / SHOW COLUMNS / SELECT *. New columns can
land without breaking dashboards. Canonicalised at
concepts/schema-discovery-vs-data-access-separation — without
this property, every schema migration would force a review-queue
flush before consumers stopped breaking.
Two-pass classifier as the recall-vs-cost dial¶
The Skimmer-specific implementation choice — "first, a fast per-column classifier; then, if anything is flagged, an agentic second pass that gets full table context and can query Trino directly to verify" — is canonicalised at patterns/two-pass-pii-classifier-with-agentic-second-pass. It's the recall-vs-cost dial for the automated layer:
- Fast per-column classifier alone has limited recall on opaque IDs that are only PII when joined.
- Agentic full-context second pass closes the recall gap but is expensive — only run for flagged columns.
Composes with column-level governance¶
The pattern naturally composes with column-level masking and ABAC tag-driven policies (see patterns/tag-driven-attribute-based-access-control for the Databricks instance). Per-column classification makes per-column masking expressible:
- Skimmer marks
users.emailasPII:email. - Lifeguard's policy:
CASE WHEN user.has(grant:raw-pii) THEN email ELSE mask(email) END. - Trino enforces at query time.
Combined with opt-in PII redaction per session, this is the column-level counterpart to the table-level default-closed allowlist.
Anti-patterns this replaces¶
- "Open by default, scan asynchronously, lock down later" — sensitive data leaks before the scan completes; scanning becomes evidence-of-compromise rather than prevention.
- "Manual classification" — reviewers spend more time classifying than approving; throughput collapses; teams skip reviews; coverage drops.
- "Permission denied with no path forward" — users hit walls, file tickets, lose trust in the platform, route around it.
Seen in¶
- sources/2026-05-28-cloudflare-how-we-built-cloudflares-data-platform-and-an-ai-agent-on-top-of-it — canonical wiki instance. Skimmer + Lifeguard + Trino + Skipper (the front-door check) as the four-component implementation.
Related¶
- concepts/default-closed-table-allowlist — the architectural posture this pattern implements.
- concepts/schema-discovery-vs-data-access-separation — the affordance that keeps the pattern operationally sustainable.
- systems/cloudflare-town-lake — the platform.
- systems/cloudflare-skimmer — the classifier.
- systems/cloudflare-lifeguard — the allowlist registry.
- patterns/two-pass-pii-classifier-with-agentic-second-pass — the recall-vs-cost classifier shape.
- patterns/error-message-as-self-serve-permission-request — the self-serve UX pattern.
- patterns/tag-driven-attribute-based-access-control — sibling Databricks pattern for column-level masking.