Skip to content

CONCEPT Cited by 1 source

Layered grounded context for data agent

A design pattern for AI data agents (LLM agents that turn natural-language questions into SQL against a real data warehouse) in which multiple distinct layers of grounded context are made available to the model at retrieval time — not pre-stuffed into the prompt, but pulled on demand as the agent reasons about the user's question.

The architectural argument: an LLM given just a SQL prompt and a list of table names "can hallucinate joins, misuse columns, and confidently produce a number that is completely wrong." The fix is layered context — multiple distinct sources, each with different cost / freshness / specificity properties, combined to ground the model.

Canonical wiki instance: Skipper's five layers

From the Cloudflare Town Lake / Skipper launch post:

Layer Source Lifecycle Cost
1. Schema + usage metadata DataHub (every column, type, primary key, foreign key; tables historically joined together) Continuously updated Cheap (tool call)
2. Human annotations DataHub table descriptions, glossary terms, curated tags Owner-team-maintained Cheap
3. Code-derived knowledge Per-node .meta.json from Transformer pipeline Per-run emission, automatic Cheap (already in DataHub)
4. Curated data-model pages Hand-written markdown surfaced as MCP resources Occasional manual updates Cheap
5. Runtime introspection DESCRIBE table, SELECT DISTINCT col LIMIT 20, SELECT COUNT(*) issued live to Trino Per-query Expensive — used as safety net

The ordering is cheapest-first, expensive-last. Skipper "uses [runtime introspection] sparingly as runtime context is expensive, but it's the safety net that makes the rest of the system robust."

The structural argument for each layer

Layer 1 — schema + usage metadata

What the catalog already knows about the tables: column names, types, keys, FKs, historical join patterns from query logs. The historical-join-patterns piece is the load-bearing non-obvious affordance — "DataHub knows... which tables are commonly joined together based on historical query patterns" — it lets the agent learn join paths from observed behaviour, not just declared FKs.

Layer 2 — human annotations

The contract is "the team that owns the table writes the description" — distributed authority, owner-team-maintained. Glossary terms unify column names across tables (e.g., user_id across 50 tables maps to the same business concept). The curated tag is the structural separation between validated production tables and scratch space.

Layer 3 — code-derived knowledge

The architecturally distinctive layer. From the post:

"Some of the most valuable context is not in any catalog: it's in the SQL that produces the table. The Transformer pipeline emits per-node .meta.json documentation to DataHub on every successful run. So when Skipper looks at fct.billings_allocated, it doesn't just see the schema; it sees that this is a pre-joined fact table built from dim.accounts, dim.customers, and seed.product_classification, with its alloc_amount column computed as billed_amount / 12 for annual; billed_amount for monthly."

Canonicalised at concepts/code-as-context-for-data-agentsthe SQL that produces a table carries semantic information no column description ever captures. The structural property: the emission is automatic per Transformer run, so the documentation stays in sync with the actual transformation logic for free.

Layer 4 — curated data-model pages

"We maintain a small set of 'data model' pages: short, human- written documents that describe how to think about billing, customers, accounts, and zones."

These are policy-and-orientation documents:

"Prefer tables tagged 'curated'. Avoid scratch_r2 and tables tagged 'internal'. Search with data model terms (e.g., 'billing product revenue') not natural language."

Surfaced as MCP resources the agent can pull when the question matches.

Layer 5 — runtime introspection (safety net)

"When everything else fails, Skipper can issue live queries to Trino: DESCRIBE table, SELECT DISTINCT col LIMIT 20, SELECT COUNT(*). It uses these sparingly as runtime context is expensive, but it's the safety net that makes the rest of the system robust."

The cost is the actual query execution; the value is ground truth for everything the catalog might be lying about — schema drift not yet propagated, value distribution that doesn't match documentation, row counts that contradict the table description.

Why layers and not a single big context

Three structural arguments:

  1. Cost asymmetry — runtime introspection is orders of magnitude more expensive than catalog lookups; the model should pay for it only when needed. Pre-stuffing everything into the prompt wastes tokens on context that's irrelevant to the current question.
  2. Freshness asymmetry — schema metadata is continuously updated; human annotations change weekly; data-model pages change rarely; runtime data is always fresh. Layering lets each source have its own update cadence.
  3. Specificity asymmetry — schema is generic, code-derived knowledge is table-specific, runtime introspection is query-specific. Different layers answer different questions.

Sibling concepts elsewhere in the wiki

The layered-context shape recurs across data agents:

The Cloudflare framing distinctively names all five layers explicitly and characterises each one's lifecycle / cost / specificity. It's the most articulated layered-context recipe in the corpus.

Seen in

Last updated · 542 distilled / 1,571 read