PATTERN Cited by 1 source

ELT on Workflows with DO state¶

A pattern for building an internal ELT (extract, load, transform) engine on top of Cloudflare's Developer Platform. Pipelines are YAML-frontmatter-defined SQL DAGs; execution runs on Workflows; per-DAG state lives in Durable Objects; definitions are stored in R2; run history goes in D1; SQL runs on the underlying analytics engine (Trino in the Town Lake instance); per-pipeline-node .meta.json documentation is emitted to DataHub on every successful run.

Cloudflare Transformer is the canonical wiki instance, from the 2026-05-28 Town Lake / Skipper launch post.

The four-substrate decomposition¶

Concern	Substrate	Why this substrate
Execution	Workflows	Durable, retryable orchestration with checkpointing — see concepts/durable-execution
Per-DAG state	Durable Objects	Strongly-consistent state machine per DAG; gates retries / idempotency / single-writer guarantees
Definitions	R2	Versioned, durable, cheap to store; YAML + SQL files
Run history	D1	Relational metadata for query/report on past runs
SQL execution	Trino (or analogous)	The actual transformation work runs on the analytics engine, not the orchestrator
Documentation emission	DataHub	Per-node `.meta.json` written on every successful run — substrate for concepts/code-as-context-for-data-agents

Why each substrate is the right choice¶

Workflows for execution¶

ELT pipelines are multi-step, durable, retryable — exactly the workload durable execution is designed for. Workflows give:

Step-level checkpointing — failures don't restart the whole DAG.
Native retry semantics with backoff.
Schedule-based and event-based triggers.

The alternative would be a custom orchestrator with retry + checkpoint logic — a substantial engineering effort that Workflows already does.

Durable Objects for per-DAG state¶

Each DAG run needs a single-writer state machine: which step is running, which steps have completed, which inputs are ready. Durable Objects give exactly that — one DO per DAG run, with strongly-consistent state, no sharding required. The alternative would be a centralised state DB with per-row locks — much more operational overhead.

R2 for definitions¶

Pipeline definitions are YAML + SQL files, stored together as a tree per pipeline. R2 is the obvious fit:

Cheap to store.
Versioned via R2 object versioning.
S3-compatible API for tooling.
No per-file metadata DB needed.

D1 for run history¶

Run history is relational metadata — runs, steps, outcomes, durations, errors. SQL queries over the history ("all failed runs in the last week", "average duration of step X") are natural. D1 is the right fit: relational, queryable, cheap.

YAML frontmatter as the pipeline-author contract¶

"Users define a Directed Acyclic Graph (DAG) of SQL transformations with YAML frontmatter (target table, materialization mode, dependencies, schedule)."

Four named fields:

target_table — what the SQL produces.
materialization — full-replace / incremental / view / etc.
dependencies — upstream tables this node depends on.
schedule — cron-like trigger.

The DAG topology is inferred from the dependencies — pipeline authors don't draw a graph, they list each node's upstreams. Transformer compiles the DAG and runs it.

The `.meta.json` emission per node¶

The architecturally distinctive feature, repeatedly stressed in the source post:

"The Transformer pipeline emits per-node .meta.json documentation to DataHub on every successful run."

This makes the SQL itself the documentation source — the substrate of code as context for data agents. The structural property: never drifts from the SQL, because it's regenerated on every successful run.

Composes with default-closed governance¶

Tables produced by Transformer pipelines enter the Skimmer scan queue and become Lifeguard-allowlist-pending until reviewed (default-closed by default). Transformer is the production path for the data; the governance layer doesn't treat Transformer-produced tables specially — they go through the same review flow as imported tables.

Self-serve data engineering as the goal¶

The post names the broader product vision:

"The goal is for any team at Cloudflare to be able to build a curated dataset with a few SQL files and a .meta.json description, deploy it as a Workflow, get it scheduled and monitored automatically, and have it surface in DataHub and Skipper without any additional work. The idea is self-serve data engineering, with the same shape as self-serve software engineering."

The pattern's structural commitment is that YAML + SQL is the entire developer surface — no proprietary DSL, no custom orchestration UI, no separate metadata declaration step.

Sibling pattern at the CI domain¶

patterns/ci-pipeline-as-customer-authored-durable-workflow is the same architectural shape applied to CI pipelines — customer-authored YAML, durable Workflow execution, Durable Object state. The Cloudflare Developer Platform pattern of "customer-built pipeline as durable workflow with DO state" generalises across domains; Transformer is the data-engineering instance.

When this pattern fits¶

Internal data platforms built on Cloudflare's Developer Platform.
Workflows where SQL is the unit of computation — most ELT, much analytics, some operational data movement.
Workflows that need durable execution — long-running, multi-step, with intermediate state.
Workflows where the governance + agent integration matters — automatic DataHub emission is the load-bearing feature for Skipper-like agents.

When this pattern doesn't fit¶

Sub-second latency — Workflow + DO + R2 + Trino round-trip is multi-hundred-ms minimum; not fit for real-time.
Streaming — pure batch shape; for streaming use Pipelines / Stream Workers / Kafka-backed sinks.
Non-SQL transformations — if the transformation needs Python / arbitrary code, Workflows alone is the right substrate; this pattern's distinguishing feature (YAML + SQL) doesn't apply.

Seen in¶

sources/2026-05-28-cloudflare-how-we-built-cloudflares-data-platform-and-an-ai-agent-on-top-of-it — canonical wiki instance. Transformer's four-substrate decomposition + YAML-frontmatter pipeline contract + .meta.json emission as the worked example.

systems/cloudflare-transformer-elt — canonical wiki instance.
systems/cloudflare-workflows — execution substrate.
systems/cloudflare-durable-objects — per-DAG state.
systems/cloudflare-d1 — run history.
systems/cloudflare-r2 — definitions storage.
systems/cloudflare-town-lake — the platform Transformer populates.
systems/datahub — metadata destination.
systems/trino — SQL execution engine.
concepts/code-as-context-for-data-agents — the architectural insight Transformer's .meta.json is the substrate of.
concepts/durable-execution — the Workflows-level concept this pattern depends on.
patterns/ci-pipeline-as-customer-authored-durable-workflow — sibling pattern at the CI domain.