Skip to content

CLOUDFLARE 2026-05-28 Tier 1

Read original ↗

Cloudflare — How we built Cloudflare's data platform and an AI agent on top of it

Summary

Cloudflare engineering post (2026-05-28) describing the two in-house systems they built to consolidate analytics access across the company: Town Lake — a data lakehouse that exposes "a single SQL interface to everything Cloudflare knows" — and Skipper, an AI data agent that turns natural-language questions into validated SQL against Town Lake. The system is built end-to-end on Cloudflare's own customer Developer Platform primitives: R2 for storage, Workers for compute, Cloudflare Access for authentication, Workflows for orchestration, Durable Objects for state, D1 for relational metadata, KV for ephemeral caches, and Workers AI for PII classification — "if we were going to make a major investment in our data infrastructure, it was going to be built on the same products we sell to customers." This is the most comprehensive dogfood-as-reference-architecture post in Cloudflare's corpus to date, threading the entire platform across one workload at 1B+ events/sec scale. Five named subsystems sit between the engine and storage tiers: R2 Data Catalog (managed Apache Iceberg service holding cold/warm data with schema evolution + time travel + age-tiered compaction), DataHub (the metadata catalog: every table, column, owner, lineage edge, glossary term), Lifeguard (access-control service — rules in D1, group memberships pulled from the internal IAM, combined JSON policy served to Trino over HTTP), Skimmer (continuous PII detection — fast per-column classifier then agentic second pass with full table context), and Transformer (YAML-frontmatter-defined SQL DAG ELT engine running on Workflows — DO-managed state, R2- stored definitions, D1 run history). The query engine is Apache Trino: "a single SQL query can join a Postgres table, a ClickHouse table, and an Iceberg table on R2 without a need to materialize the intermediate results into a different system" (predicate pushdown into source, dimension joins in Postgres, ranking in R2 — one plan, one engine). Skipper layers five tiers of grounded context (DataHub schema/usage metadata → human annotations → code-derived knowledge from the Transformer pipeline emitting .meta.json per node → curated data-model pages → runtime-introspection escape hatch — DESCRIBE / DISTINCT / COUNT issued live to Trino as a safety net). The MCP integration uses Code Mode: two tools (search, execute) instead of 30 individual ones; the model writes a JavaScript snippet calling Skipper's full toolset programmatically, executed in a sandboxed Dynamic Worker isolate via WorkerLoader"faster, cheaper, and the workflows it produces are auditable as code." Default- closed governance: tables are inaccessible until reviewed — Skimmer scans, classifies columns, registers as pending in the allowlist; users see what tables exist (schema discovery) but unreviewed columns are hidden from DESCRIBE / SHOW COLUMNS / SELECT * ( schema discovery vs data access separation); query errors are "this table needs review, click here to request one" not "permission denied" — Skipper even suggests the right RBAC group. PII is opt-in per session (concepts/opt-in-pii-redaction-per-session) — Trino redacts sensitive columns by default; legitimate need flips a session bit, permissions re-checked, redaction lifted, every flip + query logged. The security model is the data model (concepts/security-model-as-data-model): everything Skipper does runs as the calling user; if the user can't query a table, Skipper can't query it for them; saved queries shared with a teammate are checked at view time, not save time because group membership changes. Embedded dashboards ride on a single placeholder div + script tag, iframe auto-resizes, CSP frame-ancestors blocks embedding outside corporate domain, Cloudflare Access still gates iframe contents, non-owner viewers are checked against the underlying tables — pattern is the zero-trust iframe. Operational disclosures: "more than a billion events every second" on Cloudflare's network; 53% of all Town Lake queries are billing-related, 91,760 queries from 324 distinct employees in a recent measurement period; legacy "200–300 line" revenue rollup SQL is now five lines; "top 100 customers by revenue" and "how many domains that signed up today are in the top 100" take "about three seconds in Skipper"; an LLM seeing customer_type = paygo "defaults to paygo when Salesforce data is missing"the kind of nuance that separates a correct answer from a confidently wrong one comes from the SQL pipeline, not the column description (concepts/code-as-context-for-data-agents). Four design lessons named explicitly: (1) less prompting is more"the model is good at reasoning about analytical workflows; it doesn't need to be micromanaged" (concepts/less-prompting-is-more-for-analytical-agents); (2) tool overlap is poison"the model got confused and called the wrong one. We consolidated. Now fetch_results has a mode parameter (inject / display / both) instead of three separate tools" (concepts/tool-overlap-poisons-agent-accuracy); (3) code, not metadata, captures meaning — biggest accuracy wins came from ingesting the SQL that produces a table, not its schema (concepts/code-as-context-for-data-agents); (4) memory matters more than expected"the agent rediscovers and re-learns these every conversation. With one, it gets monotonically better at the recurring questions a team actually asks" (concepts/agent-infrastructure-memory). And the closing thesis: the boring infrastructure is the hard part"Trino + Iceberg is not new technology. The hard work is in the boring stuff: per-row access control, default-closed table allowlisting, query auditing, time-bound credentials, PII detection, idempotent ingestion, schema evolution" — those are "the things that make a data platform safe to actually use." The future direction names two paths: deeper integration with internal chat + ticketing systems for "ask the data" as the debugging-first-move, and migrating Town Lake's workflow over to R2 SQL (Cloudflare's serverless distributed analytics query engine) as its feature set expands.

Key takeaways

  1. Single SQL interface over heterogeneous sources, no intermediate materialization. "A single SQL query can join a Postgres table, a ClickHouse table, and an Iceberg table on R2 without a need to materialize the intermediate results into a different system. A query that asks 'what are the top 100 paying customers by Workers requests this week' compiles into a plan that pushes filters into ClickHouse, joins against an account dimension in Postgres, and ranks against billing rollups in R2, all in one go." Trino is the engine; predicate pushdown into the source tier is the structural property. Canonicalised at patterns/single-sql-interface-over-heterogeneous-sources.

  2. Iceberg + R2 as the cold/warm tier, with age-based recompaction as the cost lever. "Per-minute usage from last week becomes hourly, hourly from last quarter becomes daily, etc. The storage cost decreases as recency does, while the data stays queryable. Parquet files in R2 are much cheaper compared to keeping the same data in an OLAP database." Iceberg supplies schema evolution + time travel + partition evolution. The recency-tiered compaction is the cost-vs-fidelity dial.

  3. Default-closed governance — the inversion of the traditional "open-by-default, lock-down-by-exception" stance. "Town Lake takes the opposite approach. Tables are inaccessible for querying until they have been reviewed." The cost is bounded by automation: Skimmer's classifier "catches obvious PII (emails, IPs, names, phone numbers) and the long tail of non-obvious sensitive data (API tokens that match certain prefixes, opaque IDs that can be traced back to users). Reviewers see what was detected and either approve, override, or deny. Most reviews take seconds." Canonicalised at concepts/default-closed-table-allowlist + patterns/default-closed-allowlist-with-automated-pii-scan.

  4. Two-pass PII classifier — fast per-column, then agentic with table context. "It does this in two passes: first, a fast per-column classifier; then, if anything is flagged, an agentic second pass that gets full table context and can query Trino directly to verify." The structural argument is that per-column classification has limited recall on opaque IDs that are only PII when joined; the agentic pass takes table-level context to verify. Canonicalised at patterns/two-pass-pii-classifier-with-agentic-second-pass.

  5. Schema discovery is separated from data access. "Users can see what tables exist, but unreviewed columns are hidden from DESCRIBE and SHOW COLUMNS and from SELECT *. That subtle distinction matters: it means a new unreviewed column doesn't break existing dashboards built on the rest of an approved table." This is the load-bearing affordance that lets default-closed governance not become operationally hostile. Canonicalised at concepts/schema-discovery-vs-data-access-separation.

  6. PII redaction is opt-in per session, every flip logged. "By default, Trino redacts sensitive columns before they ever hit your screen. If you have a legitimate need for raw PII (e.g., fraud investigation), you flip the bit on the session, your permissions are checked, and the redaction is lifted. The flip and every query is logged." The session as the unit of PII opt-in (rather than the user, the table, or the query) is the deliberate choice. Canonicalised at concepts/opt-in-pii-redaction-per-session.

  7. Error messages are self-serve permission requests, not denials. "If you query a table you don't have access to, the error message is not 'permission denied.' It's 'this table needs review, click here to request one.' Skipper, the AI agent, will even suggest the right RBAC group to request and link you straight to it." Default-closed without self-serve-on-error would be operationally hostile; the design pairs them. Canonicalised at patterns/error-message-as-self-serve-permission-request.

  8. Five layers of grounded context for the AI agent. Skipper "learned the hard way" that "an LLM, given a SQL prompt and a list of table names, can hallucinate joins, misuse columns, and confidently produce a number that is completely wrong." The stack:

  9. Layer 1: schema + usage metadata from DataHub (every column, type, primary key, foreign key + tables historically joined together).
  10. Layer 2: human annotations — table descriptions written by owning teams, glossary terms, curated tag for validated tables.
  11. Layer 3: code-derived knowledge"the most valuable context is not in any catalog: it's in the SQL that produces the table. The Transformer pipeline emits per-node .meta.json documentation to DataHub on every successful run. So when Skipper looks at fct.billings_allocated, it doesn't just see the schema; it sees that this is a pre-joined fact table built from dim.accounts, dim.customers, and seed.product_classification, with its alloc_amount column computed as billed_amount / 12 for annual; billed_amount for monthly."
  12. Layer 4: curated data-model pages — "short, human-written documents that describe how to think about billing, customers, accounts, and zones." Surfaced as MCP resources.
  13. Layer 5: runtime introspection — Skipper can issue live queries: DESCRIBE table, SELECT DISTINCT col LIMIT 20, SELECT COUNT(*). "It uses these sparingly as runtime context is expensive, but it's the safety net that makes the rest of the system robust." Canonicalised at concepts/layered-grounded-context-for-data-agent.

  14. Code Mode for the MCP server. "For our MCP server, we use Code Mode. Instead of defining 30 individual tools, we expose two: search and execute. The model writes a JavaScript snippet that calls our entire toolset programmatically: const datasets = await skipper.search_datasets(...); const queryId = await skipper.start_query(...); return skipper.create_chart(...). That JavaScript runs in a sandboxed Dynamic Worker isolate via WorkerLoader. The model gets to express complex multi-step workflows in a single round-trip, in a language it already knows extremely well. It's faster, it's cheaper, and the workflows it produces are auditable as code." Two tools collapse a five-tool-multi-round-trip workflow into one. Canonicalised at patterns/code-mode-mcp-for-data-agent.

  15. The security model is the data model. "Everything Skipper does runs as the calling user. If you don't have access to a table, Skipper can't query it for you. If you ask for PII, your permissions are checked. If a query you save is shared with a teammate, their access is checked at view time, not at save time, because group membership changes." The agent has no privilege escalation — it inherits exactly the calling user's authority. Canonicalised at concepts/security-model-as-data-model.

  16. Embedded dashboards = single-tag iframe + CSP + Cloudflare Access. Embedding contract:

    <div data-skipper-dashboard="dash-123"></div>
    <script src="https://skipper.cloudflare.com/embed.js" async></script>
    
    The iframe auto-resizes; CSP frame-ancestors blocks embedding from anywhere outside the corporate domain; Cloudflare Access still gates the iframe contents, "so an unauthenticated viewer hits the Access login page in the iframe rather than seeing the data"; non-owner viewers are checked against the underlying tables — if they don't have access, "they get pointed at the right group to request." Canonicalised at patterns/embedded-dashboard-with-zero-trust-iframe.

  17. Less prompting is more. "Early versions of Skipper had elaborate, prescriptive system prompts: 'First, use search_datasets. Then, use get_entity_details. Then, use list_schema_fields if needed...'. Quality went down. The model is good at reasoning about analytical workflows; it doesn't need to be micromanaged. We replaced the prescriptive prompts with high-level guidance and let the model pick its own path. Results got better." Canonicalised at concepts/less-prompting-is-more-for-analytical-agents.

  18. Tool overlap is poison. "We initially exposed every variant of every tool: three different 'fetch results' tools, two 'search' tools, several 'list' tools. The model got confused and called the wrong one. We consolidated. Now fetch_results has a mode parameter (inject / display / both) instead of three separate tools. Every tool has a single reason to exist." Canonicalised at concepts/tool-overlap-poisons-agent-accuracy.

  19. Code captures meaning that descriptions never do. "A customer_type column with values contract, paygo, free looks identical in either context, but the SQL tells you that customer_type defaults to paygo when Salesforce data is missing. That kind of context never lives in column descriptions." The implication: if you want a data agent to be reliable, ingest the transformations, not just the catalogs. Canonicalised at concepts/code-as-context-for-data-agents.

  20. Memory matters more than expected. "There is a long tail of corrections that look like 'you have to filter for X like this' or 'ignore tables tagged Y.' Without a memory layer, the agent rediscovers and re-learns these every conversation. With one, it gets monotonically better at the recurring questions a team actually asks." Sibling to existing concepts/agent-infrastructure-memory (Grafana Assistant) — same lesson observed independently in two production deployments.

  21. The boring infrastructure is the hard part. "Trino + Iceberg is not new technology. The hard work is in the boring stuff: per-row access control, default-closed table allowlisting, query auditing, time-bound credentials, PII detection, idempotent ingestion, schema evolution. Those are the things that make a data platform safe to actually use." The closing thesis is the architectural claim of the post — the differentiator isn't the engine choice, it's the governance fabric.

Architecture extracted

Top-level topology

operational substrates                      query / agent layer
─────────────────────                       ───────────────────
Postgres (account metadata)  ──┐
ClickHouse (analytics events) ─┤            Skipper (AI agent)
Kafka (real-time signals)     ─┼─► Trino ──◄    ▲
R2/Iceberg (rollups, logs)    ─┤   (single        │ Code Mode
BigQuery (legacy rollups)     ─┘   SQL plane)     │ MCP server
                                          │       │
                                  ┌───────┴───────┴─────┐
                                  │ Lifeguard (D1 + IAM │
                                  │  → JSON policy)     │
                                  └─────────────────────┘
        ┌────────────┬────────────┬───────┴────┬──────────────┐
        │            │            │            │              │
   DataHub      Skimmer      Transformer    R2 Data      Ingestion
   (metadata    (PII scan,   (ELT engine    Catalog      (k8s
    catalog,     2-pass      on Workflows;  (managed     orchestrator
    lineage,     classifier) DAG of SQL     Iceberg)     + worker
    glossary)                with YAML                    pipelines)
                             frontmatter)

Storage tiering (Iceberg recompaction shape)

  • Per-minute usage — last week.
  • Hourly — last quarter.
  • Daily — beyond.

Each demotion preserves queryability; only fidelity drops. The storage-cost-vs-recency curve is the design lever.

Lifeguard policy resolution

user query ──► Trino ──► HTTP fetch ──► Lifeguard ──► JSON policy
                                            ├─► D1 (access rules)
                                            ├─► internal IAM (group memberships)
                                            └─► Skimmer allowlist (per-table review state)

Lifeguard "feeds basic access information to Skipper and the Gateway, so users get blocked at the front door rather than at query time" — the agent layer pre-checks before generating SQL.

Skimmer two-pass classifier

table created / connected
Pass 1: per-column fast classifier (Workers AI)
        ├── nothing flagged ──► register as pending review (column-level)
        └── flagged ──► Pass 2: agentic classifier
                              ├── reads full table context
                              ├── can query Trino directly to verify
                              └── findings → DataHub + Lifeguard allowlist (pending)
                              human reviewer: approve / override / deny
                              column unlocked OR explicitly denied

Transformer ELT engine

  • Definition: directed-acyclic graph of SQL transformations.
  • YAML frontmatter per node: target table, materialization mode, dependencies, schedule.
  • Compilation: Transformer compiles the DAG and runs it on Trino.
  • State: managed by Durable Objects (per-DAG state machine, retries, idempotency).
  • Definitions: stored in R2 (versioned).
  • Run history: in D1.
  • Documentation emission: per-node .meta.json written to DataHub on every successful run — this is the substrate of Skipper's Layer 3 grounded context.
  • Canonicalised at patterns/elt-on-workflows-with-do-state.

Skipper grounded context (5 layers, summarised)

Layer Source Lifecycle
1 DataHub schema + usage metadata Continuously updated
2 Human annotations (table descriptions, glossary, curated tags) Owner-team-maintained
3 Code-derived .meta.json from Transformer pipeline Per-run emission, automatic
4 Curated data-model pages (markdown, MCP resources) Hand-written, occasional
5 Runtime introspection (DESCRIBE / DISTINCT / COUNT to Trino) Per-query, expensive — used as safety net

Skipper Code Mode shape

Tool surface exposed as two MCP tools: search, execute. The model writes a single JavaScript snippet:

const datasets = await skipper.search_datasets({ query: "billing product revenue" })
const queryId = await skipper.start_query({ sql: "SELECT ..." })
const results = await skipper.fetch_results({ queryId, mode: "inject" })
return skipper.create_chart({ chartType: "bar", data: results.rows, ... })

The snippet runs in a sandboxed Dynamic Worker isolate via WorkerLoader. Five-tool workflow collapses into one round-trip; "the workflows it produces are auditable as code."

Embedded dashboard contract

<div data-skipper-dashboard="dash-123"></div>
<script src="https://skipper.cloudflare.com/embed.js" async></script>

Stack: iframe auto-resize → CSP frame-ancestors (corporate domain only) → Cloudflare Access gates iframe contents → underlying-table permission check at view time → fallback link to RBAC group request.

Operational numbers (verbatim from post)

  • >1 billion events per second on Cloudflare's network (the motivating scale).
  • 330+ cities, 120+ countries — network footprint.
  • 53% of all Town Lake queries are billing-related — by far the largest internal use case.
  • 91,760 queries from 324 distinct Cloudflare employees in a recent measurement period.
  • Legacy revenue-rollup SQL: 200–300 lines → now 5 lines in Skipper.
  • "top 100 customers by revenue" / "how many domains that signed up today are in the top 100"~3 seconds in Skipper.
  • "Find the top 100 billing support tickets from customers who have spent >$100""used to be a multi-day project. Now it's a Skipper query."

Named use cases

Use case Surface Notes
Billing Billable Usage Dashboard Iceberg-backed metering pipeline; "the number on the dashboard matches the number on the bill." 53% of queries
Business intelligence Skipper conversational Top-customer / top-domain queries, ~3s latency
Security analytics Bot Management ML scoring "score > 0.9 in the last 48 hours filtered by ASN and geography"; threat researcher toolkit
Trust & Safety Direct SQL + dashboards Abuse signal pulls
Customer support Skipper conversational Top-N support tickets by spend

Caveats

  • Tier-1 product-narrative writeup — architecturally rich but no benchmark numbers (Trino throughput, Iceberg compaction cost-per-tenant, Lifeguard policy resolution latency, Skimmer classifier precision/recall on PII) and no failure-mode war stories (the bottlenecks they hit and how they widened them). The post is the architectural shape, not the operations log.
  • No QPS / concurrency / cardinality numbers for Town Lake itself — the only quantitative envelope is "91,760 queries from 324 employees" over a recent period (window unspecified).
  • R2 SQL migration is named only as a future direction — no detail on which Town Lake workloads move first, what the migration shape is, or how Trino + R2 SQL coexist during the transition.
  • Skimmer false-positive / false-negative rates are not disclosed — "Reviewers see what was detected and either approve, override, or deny. Most reviews take seconds" is the only quantitative claim.
  • Internal-IAM / group-membership latency is not characterised — the policy resolution path Lifeguard runs (D1 + IAM → JSON policy → Trino over HTTP) is described structurally, not performance-wise.
  • No cost numbers for the lakehouse vs the previous "external vendors for some of our internal data" state.
  • Skipper's memory layer is named but not detailed"a long tail of corrections... With one, it gets monotonically better" leaves open: per-user, per-team, or per-workspace? Persistence backend? Conflict resolution across users? Sibling to concepts/agent-infrastructure-memory (Grafana Assistant); the wiki has parallel framings but neither post discloses the full mechanism.
  • The author of the post is not named — corporate-blog voice; no individual engineer attribution.

Source

Last updated · 542 distilled / 1,571 read