SYSTEM Cited by 1 source

Transformer (Cloudflare ELT engine)¶

Transformer is Cloudflare's ELT (extract, load, transform) engine inside Town Lake. It is a SQL-DAG orchestrator built on the customer Developer Platform — DAG state in Durable Objects, definitions in R2, run history in D1, execution on Workflows running SQL against Trino. Introduced publicly in the 2026-05-28 launch post.

Naming disambiguation¶

Two distinct Transformer entries exist on this wiki:

Transformer (Cloudflare) — this page; the ELT engine.
Transformer (ML architecture) — the attention-based neural-network architecture.

How users define a pipeline¶

"Users define a Directed Acyclic Graph (DAG) of SQL transformations with YAML frontmatter (target table, materialization mode, dependencies, schedule). Transformer compiles the graph and runs it on Trino, with state managed by Durable Objects, definitions stored in R2, and run history in D1."

A Transformer node is a SQL file with YAML frontmatter:

---
target_table: fct.billings_allocated
materialization: incremental
dependencies:
  - dim.accounts
  - dim.customers
  - seed.product_classification
schedule: "0 */6 * * *"
---

SELECT
  account_id,
  customer_id,
  CASE
    WHEN billing_period = 'annual' THEN billed_amount / 12
    ELSE billed_amount
  END AS alloc_amount,
  ...
FROM dim.accounts
JOIN dim.customers USING (customer_id)
JOIN seed.product_classification ON ...

(Schema is illustrative — the post doesn't print the YAML form verbatim, only describes the four fields. The example fact table fct.billings_allocated and its alloc_amount computation are quoted.)

Architecture decomposition¶

Concern	Substrate	Why
Execution	Workflows	Durable, retryable orchestration
Per-DAG state	Durable Objects	Strongly-consistent state machine per DAG; gates retries / idempotency
Definitions	R2	Versioned, durable, cheap to store
Run history	D1	Relational metadata for query/report on past runs
SQL execution	Trino	The Town Lake query engine
Documentation emission	DataHub	Per-node `.meta.json` written on every successful run

The DO-state + Workflows-execution + R2-definitions + D1-history decomposition is canonicalised at patterns/elt-on-workflows-with-do-state — it is the standard shape for "customer-built pipeline as durable workflow on Cloudflare's Developer Platform" (sibling to patterns/ci-pipeline-as-customer-authored-durable-workflow from the Workflows reference architecture).

The `.meta.json` emission — Skipper's Layer 3 grounded context¶

The architecturally distinctive feature, repeatedly stressed in the launch post:

"The Transformer pipeline emits per-node .meta.json documentation to DataHub on every successful run. So when Skipper looks at fct.billings_allocated, it doesn't just see the schema; it sees that this is a pre-joined fact table built from dim.accounts, dim.customers, and seed.product_classification, with its alloc_amount column computed as billed_amount / 12 for annual; billed_amount for monthly. That's the kind of nuance that separates a correct answer from a confidently wrong one."

This is the substrate of concepts/code-as-context-for-data-agents — the design lesson that the SQL that produces a table carries semantic information no column description can capture. The post quotes the canonical example: "A customer_type column with values contract, paygo, free looks identical in either context, but the SQL tells you that customer_type defaults to paygo when Salesforce data is missing. That kind of context never lives in column descriptions."

The architectural property is that .meta.json is regenerated per successful run — it stays in sync with the actual transformation logic for free, with no separate documentation maintenance. This is the structural argument against hand-written column comments as the primary semantic substrate.

Self-serve data engineering as the future direction¶

The post is explicit about the broader product vision:

"We're investing heavily in the Transformer pipeline. The goal is for any team at Cloudflare to be able to build a curated dataset with a few SQL files and a .meta.json description, deploy it as a Workflow, get it scheduled and monitored automatically, and have it surface in DataHub and Skipper without any additional work. The idea is self-serve data engineering, with the same shape as self-serve software engineering."

Two structural commitments:

YAML + SQL is the entire developer surface — no special ELT DSL, no proprietary orchestration UI.
Surfacing in DataHub + Skipper is automatic — registering the table for governance + AI agent context is a side effect of a successful run, not a separate step.

Position in Town Lake¶

Transformer fans into the rest of the platform:

Output: writes Iceberg tables on R2 Data Catalog (cold/ warm tier) via Trino INSERT/MERGE.
Metadata: DataHub gets schema + lineage + per-node .meta.json documentation.
Governance: new tables enter Skimmer's scan queue and become Lifeguard-allowlist-pending until reviewed (default-closed).
Agent context: Skipper's Layer 3 grounded context layer is populated automatically as a Transformer run side-effect.

Caveats¶

The launch post discloses Transformer's architecture and goals but does not disclose:

Throughput / concurrency numbers (DAGs/minute, queries/minute).
Cost numbers per pipeline.
Failure-mode handling for partial DAG runs.
How idempotency is enforced for non-deterministic SQL (e.g., NOW()-based filters).
Schema-evolution propagation rules (when an upstream dim.X changes, what happens to downstream fct.Y).
The YAML frontmatter schema verbatim — the four named fields (target table, materialization mode, dependencies, schedule) are described but not formally specified.

Seen in¶

sources/2026-05-28-cloudflare-how-we-built-cloudflares-data-platform-and-an-ai-agent-on-top-of-it — canonical wiki source.

systems/cloudflare-town-lake — the platform Transformer populates.
systems/cloudflare-skipper — Skipper's Layer 3 grounded context comes from Transformer's .meta.json emission.
systems/cloudflare-workflows — the orchestration substrate.
systems/cloudflare-durable-objects — per-DAG state.
systems/cloudflare-d1 — run history.
systems/cloudflare-r2 — definitions storage.
systems/datahub — metadata destination.
systems/trino — execution engine.
concepts/code-as-context-for-data-agents — the architectural insight Transformer's .meta.json is the substrate of.
patterns/elt-on-workflows-with-do-state — the canonical wiki pattern.
companies/cloudflare