SYSTEM Cited by 1 source
Transformer (Cloudflare ELT engine)¶
Transformer is Cloudflare's ELT (extract, load, transform) engine inside Town Lake. It is a SQL-DAG orchestrator built on the customer Developer Platform — DAG state in Durable Objects, definitions in R2, run history in D1, execution on Workflows running SQL against Trino. Introduced publicly in the 2026-05-28 launch post.
Naming disambiguation¶
Two distinct Transformer entries exist on this wiki:
- Transformer (Cloudflare) — this page; the ELT engine.
- Transformer (ML architecture) — the attention-based neural-network architecture.
How users define a pipeline¶
"Users define a Directed Acyclic Graph (DAG) of SQL transformations with YAML frontmatter (target table, materialization mode, dependencies, schedule). Transformer compiles the graph and runs it on Trino, with state managed by Durable Objects, definitions stored in R2, and run history in D1."
A Transformer node is a SQL file with YAML frontmatter:
---
target_table: fct.billings_allocated
materialization: incremental
dependencies:
- dim.accounts
- dim.customers
- seed.product_classification
schedule: "0 */6 * * *"
---
SELECT
account_id,
customer_id,
CASE
WHEN billing_period = 'annual' THEN billed_amount / 12
ELSE billed_amount
END AS alloc_amount,
...
FROM dim.accounts
JOIN dim.customers USING (customer_id)
JOIN seed.product_classification ON ...
(Schema is illustrative — the post doesn't print the YAML form
verbatim, only describes the four fields. The example fact table
fct.billings_allocated and its alloc_amount computation are
quoted.)
Architecture decomposition¶
| Concern | Substrate | Why |
|---|---|---|
| Execution | Workflows | Durable, retryable orchestration |
| Per-DAG state | Durable Objects | Strongly-consistent state machine per DAG; gates retries / idempotency |
| Definitions | R2 | Versioned, durable, cheap to store |
| Run history | D1 | Relational metadata for query/report on past runs |
| SQL execution | Trino | The Town Lake query engine |
| Documentation emission | DataHub | Per-node .meta.json written on every successful run |
The DO-state + Workflows-execution + R2-definitions + D1-history decomposition is canonicalised at patterns/elt-on-workflows-with-do-state — it is the standard shape for "customer-built pipeline as durable workflow on Cloudflare's Developer Platform" (sibling to patterns/ci-pipeline-as-customer-authored-durable-workflow from the Workflows reference architecture).
The .meta.json emission — Skipper's Layer 3 grounded context¶
The architecturally distinctive feature, repeatedly stressed in the launch post:
"The Transformer pipeline emits per-node
.meta.jsondocumentation to DataHub on every successful run. So when Skipper looks atfct.billings_allocated, it doesn't just see the schema; it sees that this is a pre-joined fact table built fromdim.accounts,dim.customers, andseed.product_classification, with itsalloc_amountcolumn computed asbilled_amount / 12 for annual; billed_amount for monthly. That's the kind of nuance that separates a correct answer from a confidently wrong one."
This is the substrate of concepts/code-as-context-for-data-agents
— the design lesson that the SQL that produces a table carries
semantic information no column description can capture. The
post quotes the canonical example: "A customer_type column
with values contract, paygo, free looks identical in
either context, but the SQL tells you that customer_type
defaults to paygo when Salesforce data is missing. That kind of
context never lives in column descriptions."
The architectural property is that .meta.json is regenerated
per successful run — it stays in sync with the actual
transformation logic for free, with no separate documentation
maintenance. This is the structural argument against
hand-written column comments as the primary semantic substrate.
Self-serve data engineering as the future direction¶
The post is explicit about the broader product vision:
"We're investing heavily in the Transformer pipeline. The goal is for any team at Cloudflare to be able to build a curated dataset with a few SQL files and a
.meta.jsondescription, deploy it as a Workflow, get it scheduled and monitored automatically, and have it surface in DataHub and Skipper without any additional work. The idea is self-serve data engineering, with the same shape as self-serve software engineering."
Two structural commitments:
- YAML + SQL is the entire developer surface — no special ELT DSL, no proprietary orchestration UI.
- Surfacing in DataHub + Skipper is automatic — registering the table for governance + AI agent context is a side effect of a successful run, not a separate step.
Position in Town Lake¶
Transformer fans into the rest of the platform:
- Output: writes Iceberg tables on R2 Data Catalog (cold/ warm tier) via Trino INSERT/MERGE.
- Metadata: DataHub gets schema + lineage + per-node
.meta.jsondocumentation. - Governance: new tables enter Skimmer's scan queue and become Lifeguard-allowlist-pending until reviewed (default-closed).
- Agent context: Skipper's Layer 3 grounded context layer is populated automatically as a Transformer run side-effect.
Caveats¶
The launch post discloses Transformer's architecture and goals but does not disclose:
- Throughput / concurrency numbers (DAGs/minute, queries/minute).
- Cost numbers per pipeline.
- Failure-mode handling for partial DAG runs.
- How idempotency is enforced for non-deterministic SQL (e.g.,
NOW()-based filters). - Schema-evolution propagation rules (when an upstream
dim.Xchanges, what happens to downstreamfct.Y). - The YAML frontmatter schema verbatim — the four named fields (target table, materialization mode, dependencies, schedule) are described but not formally specified.
Seen in¶
- sources/2026-05-28-cloudflare-how-we-built-cloudflares-data-platform-and-an-ai-agent-on-top-of-it — canonical wiki source.
Related¶
- systems/cloudflare-town-lake — the platform Transformer populates.
- systems/cloudflare-skipper — Skipper's Layer 3 grounded
context comes from Transformer's
.meta.jsonemission. - systems/cloudflare-workflows — the orchestration substrate.
- systems/cloudflare-durable-objects — per-DAG state.
- systems/cloudflare-d1 — run history.
- systems/cloudflare-r2 — definitions storage.
- systems/datahub — metadata destination.
- systems/trino — execution engine.
- concepts/code-as-context-for-data-agents — the
architectural insight Transformer's
.meta.jsonis the substrate of. - patterns/elt-on-workflows-with-do-state — the canonical wiki pattern.
- companies/cloudflare