CONCEPT Cited by 1 source

Layered grounded context for data agent¶

A design pattern for AI data agents (LLM agents that turn natural-language questions into SQL against a real data warehouse) in which multiple distinct layers of grounded context are made available to the model at retrieval time — not pre-stuffed into the prompt, but pulled on demand as the agent reasons about the user's question.

The architectural argument: an LLM given just a SQL prompt and a list of table names "can hallucinate joins, misuse columns, and confidently produce a number that is completely wrong." The fix is layered context — multiple distinct sources, each with different cost / freshness / specificity properties, combined to ground the model.

Canonical wiki instance: Skipper's five layers¶

From the Cloudflare Town Lake / Skipper launch post:

Layer	Source	Lifecycle	Cost
1. Schema + usage metadata	DataHub (every column, type, primary key, foreign key; tables historically joined together)	Continuously updated	Cheap (tool call)
2. Human annotations	DataHub table descriptions, glossary terms, `curated` tags	Owner-team-maintained	Cheap
3. Code-derived knowledge	Per-node `.meta.json` from Transformer pipeline	Per-run emission, automatic	Cheap (already in DataHub)
4. Curated data-model pages	Hand-written markdown surfaced as MCP resources	Occasional manual updates	Cheap
5. Runtime introspection	`DESCRIBE table`, `SELECT DISTINCT col LIMIT 20`, `SELECT COUNT(*)` issued live to Trino	Per-query	Expensive — used as safety net

The ordering is cheapest-first, expensive-last. Skipper "uses [runtime introspection] sparingly as runtime context is expensive, but it's the safety net that makes the rest of the system robust."

The structural argument for each layer¶

Layer 1 — schema + usage metadata¶

What the catalog already knows about the tables: column names, types, keys, FKs, historical join patterns from query logs. The historical-join-patterns piece is the load-bearing non-obvious affordance — "DataHub knows... which tables are commonly joined together based on historical query patterns" — it lets the agent learn join paths from observed behaviour, not just declared FKs.

Layer 2 — human annotations¶

The contract is "the team that owns the table writes the description" — distributed authority, owner-team-maintained. Glossary terms unify column names across tables (e.g., user_id across 50 tables maps to the same business concept). The curated tag is the structural separation between validated production tables and scratch space.

Layer 3 — code-derived knowledge¶

The architecturally distinctive layer. From the post:

"Some of the most valuable context is not in any catalog: it's in the SQL that produces the table. The Transformer pipeline emits per-node .meta.json documentation to DataHub on every successful run. So when Skipper looks at fct.billings_allocated, it doesn't just see the schema; it sees that this is a pre-joined fact table built from dim.accounts, dim.customers, and seed.product_classification, with its alloc_amount column computed as billed_amount / 12 for annual; billed_amount for monthly."

Canonicalised at concepts/code-as-context-for-data-agents — the SQL that produces a table carries semantic information no column description ever captures. The structural property: the emission is automatic per Transformer run, so the documentation stays in sync with the actual transformation logic for free.

Layer 4 — curated data-model pages¶

"We maintain a small set of 'data model' pages: short, human- written documents that describe how to think about billing, customers, accounts, and zones."

These are policy-and-orientation documents:

"Prefer tables tagged 'curated'. Avoid scratch_r2 and tables tagged 'internal'. Search with data model terms (e.g., 'billing product revenue') not natural language."

Surfaced as MCP resources the agent can pull when the question matches.

Layer 5 — runtime introspection (safety net)¶

"When everything else fails, Skipper can issue live queries to Trino: DESCRIBE table, SELECT DISTINCT col LIMIT 20, SELECT COUNT(*). It uses these sparingly as runtime context is expensive, but it's the safety net that makes the rest of the system robust."

The cost is the actual query execution; the value is ground truth for everything the catalog might be lying about — schema drift not yet propagated, value distribution that doesn't match documentation, row counts that contradict the table description.

Why layers and not a single big context¶

Three structural arguments:

Cost asymmetry — runtime introspection is orders of magnitude more expensive than catalog lookups; the model should pay for it only when needed. Pre-stuffing everything into the prompt wastes tokens on context that's irrelevant to the current question.
Freshness asymmetry — schema metadata is continuously updated; human annotations change weekly; data-model pages change rarely; runtime data is always fresh. Layering lets each source have its own update cadence.
Specificity asymmetry — schema is generic, code-derived knowledge is table-specific, runtime introspection is query-specific. Different layers answer different questions.

Sibling concepts elsewhere in the wiki¶

The layered-context shape recurs across data agents:

Databricks Genie uses semantic-enterprise-context + verifiable-test-gap-data-queries (concepts/semantic-enterprise-context, concepts/verifiable-test-gap-data-queries) — Skipper Layer 4 + 5 analogues.
Grafana Assistant uses per-service prebuilt context with weekly refresh (concepts/agent-infrastructure-memory, patterns/precomputed-agent-context-files) — Skipper Layer 1+2+3 prebuilt analogue.
Pinterest Analytics Agent uses PinCat-driven schema grounding + governance-aware ranking — Skipper Layer 1 + 4 analogue.

The Cloudflare framing distinctively names all five layers explicitly and characterises each one's lifecycle / cost / specificity. It's the most articulated layered-context recipe in the corpus.

Seen in¶

sources/2026-05-28-cloudflare-how-we-built-cloudflares-data-platform-and-an-ai-agent-on-top-of-it — canonical wiki instance. Skipper's five layers as the structural shape; Layer 3 (code-derived) as the architecturally distinctive contribution.

systems/cloudflare-skipper — canonical wiki instance.
systems/cloudflare-town-lake — the platform.
systems/cloudflare-transformer-elt — emits Layer 3 context.
systems/datahub — Layer 1, 2, 3 substrate.
systems/trino — Layer 5 substrate.
systems/databricks-genie — sibling analytical agent.
systems/grafana-assistant — sibling infrastructure agent.
concepts/code-as-context-for-data-agents — Layer 3 isolated.
concepts/agent-infrastructure-memory — sibling pattern.
concepts/semantic-enterprise-context — sibling concept.
concepts/specialized-knowledge-search — sibling concept.
patterns/code-mode-mcp-for-data-agent — the MCP integration that surfaces these layers.
patterns/semantic-context-grounded-search-index — sibling pattern.