PATTERN Cited by 1 source
ELT on Workflows with DO state¶
A pattern for building an internal ELT (extract, load,
transform) engine on top of Cloudflare's
Developer Platform. Pipelines are YAML-frontmatter-defined SQL
DAGs; execution runs on
Workflows; per-DAG state lives in
Durable Objects; definitions are stored in
R2; run history goes in D1; SQL runs
on the underlying analytics engine (Trino in
the Town Lake instance); per-pipeline-node .meta.json
documentation is emitted to DataHub on every
successful run.
Cloudflare Transformer is the canonical wiki instance, from the 2026-05-28 Town Lake / Skipper launch post.
The four-substrate decomposition¶
| Concern | Substrate | Why this substrate |
|---|---|---|
| Execution | Workflows | Durable, retryable orchestration with checkpointing — see concepts/durable-execution |
| Per-DAG state | Durable Objects | Strongly-consistent state machine per DAG; gates retries / idempotency / single-writer guarantees |
| Definitions | R2 | Versioned, durable, cheap to store; YAML + SQL files |
| Run history | D1 | Relational metadata for query/report on past runs |
| SQL execution | Trino (or analogous) | The actual transformation work runs on the analytics engine, not the orchestrator |
| Documentation emission | DataHub | Per-node .meta.json written on every successful run — substrate for concepts/code-as-context-for-data-agents |
Why each substrate is the right choice¶
Workflows for execution¶
ELT pipelines are multi-step, durable, retryable — exactly the workload durable execution is designed for. Workflows give:
- Step-level checkpointing — failures don't restart the whole DAG.
- Native retry semantics with backoff.
- Schedule-based and event-based triggers.
The alternative would be a custom orchestrator with retry + checkpoint logic — a substantial engineering effort that Workflows already does.
Durable Objects for per-DAG state¶
Each DAG run needs a single-writer state machine: which step is running, which steps have completed, which inputs are ready. Durable Objects give exactly that — one DO per DAG run, with strongly-consistent state, no sharding required. The alternative would be a centralised state DB with per-row locks — much more operational overhead.
R2 for definitions¶
Pipeline definitions are YAML + SQL files, stored together as a tree per pipeline. R2 is the obvious fit:
- Cheap to store.
- Versioned via R2 object versioning.
- S3-compatible API for tooling.
- No per-file metadata DB needed.
D1 for run history¶
Run history is relational metadata — runs, steps, outcomes, durations, errors. SQL queries over the history ("all failed runs in the last week", "average duration of step X") are natural. D1 is the right fit: relational, queryable, cheap.
YAML frontmatter as the pipeline-author contract¶
"Users define a Directed Acyclic Graph (DAG) of SQL transformations with YAML frontmatter (target table, materialization mode, dependencies, schedule)."
Four named fields:
- target_table — what the SQL produces.
- materialization — full-replace / incremental / view / etc.
- dependencies — upstream tables this node depends on.
- schedule — cron-like trigger.
The DAG topology is inferred from the dependencies — pipeline authors don't draw a graph, they list each node's upstreams. Transformer compiles the DAG and runs it.
The .meta.json emission per node¶
The architecturally distinctive feature, repeatedly stressed in the source post:
"The Transformer pipeline emits per-node
.meta.jsondocumentation to DataHub on every successful run."
This makes the SQL itself the documentation source — the substrate of code as context for data agents. The structural property: never drifts from the SQL, because it's regenerated on every successful run.
Composes with default-closed governance¶
Tables produced by Transformer pipelines enter the Skimmer scan queue and become Lifeguard-allowlist-pending until reviewed (default-closed by default). Transformer is the production path for the data; the governance layer doesn't treat Transformer-produced tables specially — they go through the same review flow as imported tables.
Self-serve data engineering as the goal¶
The post names the broader product vision:
"The goal is for any team at Cloudflare to be able to build a curated dataset with a few SQL files and a
.meta.jsondescription, deploy it as a Workflow, get it scheduled and monitored automatically, and have it surface in DataHub and Skipper without any additional work. The idea is self-serve data engineering, with the same shape as self-serve software engineering."
The pattern's structural commitment is that YAML + SQL is the entire developer surface — no proprietary DSL, no custom orchestration UI, no separate metadata declaration step.
Sibling pattern at the CI domain¶
patterns/ci-pipeline-as-customer-authored-durable-workflow is the same architectural shape applied to CI pipelines — customer-authored YAML, durable Workflow execution, Durable Object state. The Cloudflare Developer Platform pattern of "customer-built pipeline as durable workflow with DO state" generalises across domains; Transformer is the data-engineering instance.
When this pattern fits¶
- Internal data platforms built on Cloudflare's Developer Platform.
- Workflows where SQL is the unit of computation — most ELT, much analytics, some operational data movement.
- Workflows that need durable execution — long-running, multi-step, with intermediate state.
- Workflows where the governance + agent integration matters — automatic DataHub emission is the load-bearing feature for Skipper-like agents.
When this pattern doesn't fit¶
- Sub-second latency — Workflow + DO + R2 + Trino round-trip is multi-hundred-ms minimum; not fit for real-time.
- Streaming — pure batch shape; for streaming use Pipelines / Stream Workers / Kafka-backed sinks.
- Non-SQL transformations — if the transformation needs Python / arbitrary code, Workflows alone is the right substrate; this pattern's distinguishing feature (YAML + SQL) doesn't apply.
Seen in¶
- sources/2026-05-28-cloudflare-how-we-built-cloudflares-data-platform-and-an-ai-agent-on-top-of-it
— canonical wiki instance. Transformer's four-substrate
decomposition + YAML-frontmatter pipeline contract +
.meta.jsonemission as the worked example.
Related¶
- systems/cloudflare-transformer-elt — canonical wiki instance.
- systems/cloudflare-workflows — execution substrate.
- systems/cloudflare-durable-objects — per-DAG state.
- systems/cloudflare-d1 — run history.
- systems/cloudflare-r2 — definitions storage.
- systems/cloudflare-town-lake — the platform Transformer populates.
- systems/datahub — metadata destination.
- systems/trino — SQL execution engine.
- concepts/code-as-context-for-data-agents — the architectural
insight Transformer's
.meta.jsonis the substrate of. - concepts/durable-execution — the Workflows-level concept this pattern depends on.
- patterns/ci-pipeline-as-customer-authored-durable-workflow — sibling pattern at the CI domain.