Skip to content

SYSTEM Cited by 1 source

Transformer (Cloudflare ELT engine)

Transformer is Cloudflare's ELT (extract, load, transform) engine inside Town Lake. It is a SQL-DAG orchestrator built on the customer Developer Platform — DAG state in Durable Objects, definitions in R2, run history in D1, execution on Workflows running SQL against Trino. Introduced publicly in the 2026-05-28 launch post.

Naming disambiguation

Two distinct Transformer entries exist on this wiki:

How users define a pipeline

"Users define a Directed Acyclic Graph (DAG) of SQL transformations with YAML frontmatter (target table, materialization mode, dependencies, schedule). Transformer compiles the graph and runs it on Trino, with state managed by Durable Objects, definitions stored in R2, and run history in D1."

A Transformer node is a SQL file with YAML frontmatter:

---
target_table: fct.billings_allocated
materialization: incremental
dependencies:
  - dim.accounts
  - dim.customers
  - seed.product_classification
schedule: "0 */6 * * *"
---

SELECT
  account_id,
  customer_id,
  CASE
    WHEN billing_period = 'annual' THEN billed_amount / 12
    ELSE billed_amount
  END AS alloc_amount,
  ...
FROM dim.accounts
JOIN dim.customers USING (customer_id)
JOIN seed.product_classification ON ...

(Schema is illustrative — the post doesn't print the YAML form verbatim, only describes the four fields. The example fact table fct.billings_allocated and its alloc_amount computation are quoted.)

Architecture decomposition

Concern Substrate Why
Execution Workflows Durable, retryable orchestration
Per-DAG state Durable Objects Strongly-consistent state machine per DAG; gates retries / idempotency
Definitions R2 Versioned, durable, cheap to store
Run history D1 Relational metadata for query/report on past runs
SQL execution Trino The Town Lake query engine
Documentation emission DataHub Per-node .meta.json written on every successful run

The DO-state + Workflows-execution + R2-definitions + D1-history decomposition is canonicalised at patterns/elt-on-workflows-with-do-state — it is the standard shape for "customer-built pipeline as durable workflow on Cloudflare's Developer Platform" (sibling to patterns/ci-pipeline-as-customer-authored-durable-workflow from the Workflows reference architecture).

The .meta.json emission — Skipper's Layer 3 grounded context

The architecturally distinctive feature, repeatedly stressed in the launch post:

"The Transformer pipeline emits per-node .meta.json documentation to DataHub on every successful run. So when Skipper looks at fct.billings_allocated, it doesn't just see the schema; it sees that this is a pre-joined fact table built from dim.accounts, dim.customers, and seed.product_classification, with its alloc_amount column computed as billed_amount / 12 for annual; billed_amount for monthly. That's the kind of nuance that separates a correct answer from a confidently wrong one."

This is the substrate of concepts/code-as-context-for-data-agents — the design lesson that the SQL that produces a table carries semantic information no column description can capture. The post quotes the canonical example: "A customer_type column with values contract, paygo, free looks identical in either context, but the SQL tells you that customer_type defaults to paygo when Salesforce data is missing. That kind of context never lives in column descriptions."

The architectural property is that .meta.json is regenerated per successful run — it stays in sync with the actual transformation logic for free, with no separate documentation maintenance. This is the structural argument against hand-written column comments as the primary semantic substrate.

Self-serve data engineering as the future direction

The post is explicit about the broader product vision:

"We're investing heavily in the Transformer pipeline. The goal is for any team at Cloudflare to be able to build a curated dataset with a few SQL files and a .meta.json description, deploy it as a Workflow, get it scheduled and monitored automatically, and have it surface in DataHub and Skipper without any additional work. The idea is self-serve data engineering, with the same shape as self-serve software engineering."

Two structural commitments:

  • YAML + SQL is the entire developer surface — no special ELT DSL, no proprietary orchestration UI.
  • Surfacing in DataHub + Skipper is automatic — registering the table for governance + AI agent context is a side effect of a successful run, not a separate step.

Position in Town Lake

Transformer fans into the rest of the platform:

  • Output: writes Iceberg tables on R2 Data Catalog (cold/ warm tier) via Trino INSERT/MERGE.
  • Metadata: DataHub gets schema + lineage + per-node .meta.json documentation.
  • Governance: new tables enter Skimmer's scan queue and become Lifeguard-allowlist-pending until reviewed (default-closed).
  • Agent context: Skipper's Layer 3 grounded context layer is populated automatically as a Transformer run side-effect.

Caveats

The launch post discloses Transformer's architecture and goals but does not disclose:

  • Throughput / concurrency numbers (DAGs/minute, queries/minute).
  • Cost numbers per pipeline.
  • Failure-mode handling for partial DAG runs.
  • How idempotency is enforced for non-deterministic SQL (e.g., NOW()-based filters).
  • Schema-evolution propagation rules (when an upstream dim.X changes, what happens to downstream fct.Y).
  • The YAML frontmatter schema verbatim — the four named fields (target table, materialization mode, dependencies, schedule) are described but not formally specified.

Seen in

Last updated · 542 distilled / 1,571 read