Skip to content

PATTERN Cited by 1 source

ELT on Workflows with DO state

A pattern for building an internal ELT (extract, load, transform) engine on top of Cloudflare's Developer Platform. Pipelines are YAML-frontmatter-defined SQL DAGs; execution runs on Workflows; per-DAG state lives in Durable Objects; definitions are stored in R2; run history goes in D1; SQL runs on the underlying analytics engine (Trino in the Town Lake instance); per-pipeline-node .meta.json documentation is emitted to DataHub on every successful run.

Cloudflare Transformer is the canonical wiki instance, from the 2026-05-28 Town Lake / Skipper launch post.

The four-substrate decomposition

Concern Substrate Why this substrate
Execution Workflows Durable, retryable orchestration with checkpointing — see concepts/durable-execution
Per-DAG state Durable Objects Strongly-consistent state machine per DAG; gates retries / idempotency / single-writer guarantees
Definitions R2 Versioned, durable, cheap to store; YAML + SQL files
Run history D1 Relational metadata for query/report on past runs
SQL execution Trino (or analogous) The actual transformation work runs on the analytics engine, not the orchestrator
Documentation emission DataHub Per-node .meta.json written on every successful run — substrate for concepts/code-as-context-for-data-agents

Why each substrate is the right choice

Workflows for execution

ELT pipelines are multi-step, durable, retryable — exactly the workload durable execution is designed for. Workflows give:

  • Step-level checkpointing — failures don't restart the whole DAG.
  • Native retry semantics with backoff.
  • Schedule-based and event-based triggers.

The alternative would be a custom orchestrator with retry + checkpoint logic — a substantial engineering effort that Workflows already does.

Durable Objects for per-DAG state

Each DAG run needs a single-writer state machine: which step is running, which steps have completed, which inputs are ready. Durable Objects give exactly that — one DO per DAG run, with strongly-consistent state, no sharding required. The alternative would be a centralised state DB with per-row locks — much more operational overhead.

R2 for definitions

Pipeline definitions are YAML + SQL files, stored together as a tree per pipeline. R2 is the obvious fit:

  • Cheap to store.
  • Versioned via R2 object versioning.
  • S3-compatible API for tooling.
  • No per-file metadata DB needed.

D1 for run history

Run history is relational metadata — runs, steps, outcomes, durations, errors. SQL queries over the history ("all failed runs in the last week", "average duration of step X") are natural. D1 is the right fit: relational, queryable, cheap.

YAML frontmatter as the pipeline-author contract

"Users define a Directed Acyclic Graph (DAG) of SQL transformations with YAML frontmatter (target table, materialization mode, dependencies, schedule)."

Four named fields:

  • target_table — what the SQL produces.
  • materialization — full-replace / incremental / view / etc.
  • dependencies — upstream tables this node depends on.
  • schedule — cron-like trigger.

The DAG topology is inferred from the dependencies — pipeline authors don't draw a graph, they list each node's upstreams. Transformer compiles the DAG and runs it.

The .meta.json emission per node

The architecturally distinctive feature, repeatedly stressed in the source post:

"The Transformer pipeline emits per-node .meta.json documentation to DataHub on every successful run."

This makes the SQL itself the documentation source — the substrate of code as context for data agents. The structural property: never drifts from the SQL, because it's regenerated on every successful run.

Composes with default-closed governance

Tables produced by Transformer pipelines enter the Skimmer scan queue and become Lifeguard-allowlist-pending until reviewed (default-closed by default). Transformer is the production path for the data; the governance layer doesn't treat Transformer-produced tables specially — they go through the same review flow as imported tables.

Self-serve data engineering as the goal

The post names the broader product vision:

"The goal is for any team at Cloudflare to be able to build a curated dataset with a few SQL files and a .meta.json description, deploy it as a Workflow, get it scheduled and monitored automatically, and have it surface in DataHub and Skipper without any additional work. The idea is self-serve data engineering, with the same shape as self-serve software engineering."

The pattern's structural commitment is that YAML + SQL is the entire developer surface — no proprietary DSL, no custom orchestration UI, no separate metadata declaration step.

Sibling pattern at the CI domain

patterns/ci-pipeline-as-customer-authored-durable-workflow is the same architectural shape applied to CI pipelines — customer-authored YAML, durable Workflow execution, Durable Object state. The Cloudflare Developer Platform pattern of "customer-built pipeline as durable workflow with DO state" generalises across domains; Transformer is the data-engineering instance.

When this pattern fits

  • Internal data platforms built on Cloudflare's Developer Platform.
  • Workflows where SQL is the unit of computation — most ELT, much analytics, some operational data movement.
  • Workflows that need durable execution — long-running, multi-step, with intermediate state.
  • Workflows where the governance + agent integration matters — automatic DataHub emission is the load-bearing feature for Skipper-like agents.

When this pattern doesn't fit

  • Sub-second latency — Workflow + DO + R2 + Trino round-trip is multi-hundred-ms minimum; not fit for real-time.
  • Streaming — pure batch shape; for streaming use Pipelines / Stream Workers / Kafka-backed sinks.
  • Non-SQL transformations — if the transformation needs Python / arbitrary code, Workflows alone is the right substrate; this pattern's distinguishing feature (YAML + SQL) doesn't apply.

Seen in

Last updated · 542 distilled / 1,571 read