Skip to content

CONCEPT Cited by 1 source

Code as context for data agents

The architectural insight that for AI data agents (LLM agents that turn natural-language questions into SQL), the transformation code that produces a table carries semantic information no column description ever captures. Therefore the ELT/SQL pipeline itself — not the data catalog — should be the primary substrate for the agent's table-level grounded context.

This is one of the four explicit design lessons named in the Cloudflare Town Lake / Skipper launch post (alongside "less prompting is more", "tool overlap is poison", and "memory matters more than expected").

The canonical example

"Code, not metadata, captures meaning. The biggest accuracy wins came when we started ingesting the actual SQL that produces a table, not just its schema. A customer_type column with values contract, paygo, free looks identical in either context, but the SQL tells you that customer_type defaults to paygo when Salesforce data is missing. That kind of context never lives in column descriptions."

Two distinct columns can have identical schema and identical descriptions but radically different operational semantics based on the SQL that produces them. The data-cleaning logic — if Salesforce data is missing, default to paygo — exists nowhere in the catalog metadata. Without code-as-context, an agent answering "how many paygo customers do we have" would systematically overstate the count by including Salesforce-missing records.

Implementation: per-pipeline-node .meta.json

Cloudflare's Transformer ELT engine makes the pattern operational:

"The Transformer pipeline emits per-node .meta.json documentation to DataHub on every successful run."

The structural property: documentation regeneration is automatic on every successful run. There is no separate documentation maintenance step; the metadata stays in sync with the actual transformation logic for free. This is the anti-staleness property that makes code-as-context tractable — without it, the documentation drifts from the code and the agent slowly degrades.

The example again, with the .meta.json content stated:

"When Skipper looks at fct.billings_allocated, it doesn't just see the schema; it sees that this is a pre-joined fact table built from dim.accounts, dim.customers, and seed.product_classification, with its alloc_amount column computed as billed_amount / 12 for annual; billed_amount for monthly."

What the .meta.json carries:

  • Upstream dependencies (dim.accounts, dim.customers, seed.product_classification).
  • Per-column derivation logic (alloc_amount = billed_amount / 12 for annual; billed_amount for monthly).
  • (Implicit, not stated explicitly but architecturally implied: filter conditions, default values, deduplication rules — any semantic derivation that lives in the SQL.)

Why this beats hand-written column descriptions

Three structural reasons:

  1. The SQL is the truth — humans write column descriptions based on what they think the SQL does, then the SQL evolves without the description being updated. Code-as-context regenerates from the SQL itself.
  2. The SQL captures conditional logic — derivations like "defaults to X when Y is null" are clauses in the SQL, not sentences a human typically types into a column description.
  3. The SQL captures cross-column relationships"the alloc_amount on this row depends on billing_period from the same row" is the kind of nuance that lives in the SQL structure, not in per-column metadata.

Operational pattern: regenerate on every successful run

The architectural commitment is that documentation emission is a side effect of pipeline execution, not a separate workflow. This has two implications:

  • No drift — if the SQL changes, the next run regenerates the .meta.json fresh. The documentation can never be more than one run behind the code.
  • Failure mode: if the run fails, the docs aren't updated — but the table also isn't refreshed, so they remain consistent with the most recent successfully-emitted version of the table. This is the desired safety property.

Position in Skipper's grounded context

Code-as-context is Layer 3 of Skipper's five-layer grounded context (see concepts/layered-grounded-context-for-data-agent):

Layer Source
1 Schema + usage metadata (DataHub)
2 Human annotations (table descriptions, glossary, curated tag)
3 Code-derived knowledge (Transformer .meta.json)
4 Curated data-model pages (MCP resources)
5 Runtime introspection (DESCRIBE / DISTINCT / COUNT to Trino)

The launch post's framing is that Layer 3 "separates a correct answer from a confidently wrong one" — i.e., the biggest-accuracy-win layer.

Sibling pattern: SQL is the API contract

The implicit broader claim is that the SQL pipeline that materialises a table is the most truthful representation of what the table contains. This generalises beyond data agents — any downstream consumer (BI dashboard, ML feature engineer, audit reviewer) benefits from being able to read the production SQL, not just the schema.

Seen in

Last updated · 542 distilled / 1,571 read