CONCEPT Cited by 1 source
Layered grounded context for data agent¶
A design pattern for AI data agents (LLM agents that turn natural-language questions into SQL against a real data warehouse) in which multiple distinct layers of grounded context are made available to the model at retrieval time — not pre-stuffed into the prompt, but pulled on demand as the agent reasons about the user's question.
The architectural argument: an LLM given just a SQL prompt and a list of table names "can hallucinate joins, misuse columns, and confidently produce a number that is completely wrong." The fix is layered context — multiple distinct sources, each with different cost / freshness / specificity properties, combined to ground the model.
Canonical wiki instance: Skipper's five layers¶
From the Cloudflare Town Lake / Skipper launch post:
| Layer | Source | Lifecycle | Cost |
|---|---|---|---|
| 1. Schema + usage metadata | DataHub (every column, type, primary key, foreign key; tables historically joined together) | Continuously updated | Cheap (tool call) |
| 2. Human annotations | DataHub table descriptions, glossary terms, curated tags |
Owner-team-maintained | Cheap |
| 3. Code-derived knowledge | Per-node .meta.json from Transformer pipeline |
Per-run emission, automatic | Cheap (already in DataHub) |
| 4. Curated data-model pages | Hand-written markdown surfaced as MCP resources | Occasional manual updates | Cheap |
| 5. Runtime introspection | DESCRIBE table, SELECT DISTINCT col LIMIT 20, SELECT COUNT(*) issued live to Trino |
Per-query | Expensive — used as safety net |
The ordering is cheapest-first, expensive-last. Skipper "uses [runtime introspection] sparingly as runtime context is expensive, but it's the safety net that makes the rest of the system robust."
The structural argument for each layer¶
Layer 1 — schema + usage metadata¶
What the catalog already knows about the tables: column names, types, keys, FKs, historical join patterns from query logs. The historical-join-patterns piece is the load-bearing non-obvious affordance — "DataHub knows... which tables are commonly joined together based on historical query patterns" — it lets the agent learn join paths from observed behaviour, not just declared FKs.
Layer 2 — human annotations¶
The contract is "the team that owns the table writes the
description" — distributed authority, owner-team-maintained.
Glossary terms unify column names across tables (e.g., user_id
across 50 tables maps to the same business concept). The
curated tag is the structural separation between validated
production tables and scratch space.
Layer 3 — code-derived knowledge¶
The architecturally distinctive layer. From the post:
"Some of the most valuable context is not in any catalog: it's in the SQL that produces the table. The Transformer pipeline emits per-node
.meta.jsondocumentation to DataHub on every successful run. So when Skipper looks atfct.billings_allocated, it doesn't just see the schema; it sees that this is a pre-joined fact table built fromdim.accounts,dim.customers, andseed.product_classification, with itsalloc_amountcolumn computed asbilled_amount / 12 for annual; billed_amount for monthly."
Canonicalised at concepts/code-as-context-for-data-agents — the SQL that produces a table carries semantic information no column description ever captures. The structural property: the emission is automatic per Transformer run, so the documentation stays in sync with the actual transformation logic for free.
Layer 4 — curated data-model pages¶
"We maintain a small set of 'data model' pages: short, human- written documents that describe how to think about billing, customers, accounts, and zones."
These are policy-and-orientation documents:
"Prefer tables tagged 'curated'. Avoid
scratch_r2and tables tagged 'internal'. Search with data model terms (e.g., 'billing product revenue') not natural language."
Surfaced as MCP resources the agent can pull when the question matches.
Layer 5 — runtime introspection (safety net)¶
"When everything else fails, Skipper can issue live queries to Trino:
DESCRIBE table,SELECT DISTINCT col LIMIT 20,SELECT COUNT(*). It uses these sparingly as runtime context is expensive, but it's the safety net that makes the rest of the system robust."
The cost is the actual query execution; the value is ground truth for everything the catalog might be lying about — schema drift not yet propagated, value distribution that doesn't match documentation, row counts that contradict the table description.
Why layers and not a single big context¶
Three structural arguments:
- Cost asymmetry — runtime introspection is orders of magnitude more expensive than catalog lookups; the model should pay for it only when needed. Pre-stuffing everything into the prompt wastes tokens on context that's irrelevant to the current question.
- Freshness asymmetry — schema metadata is continuously updated; human annotations change weekly; data-model pages change rarely; runtime data is always fresh. Layering lets each source have its own update cadence.
- Specificity asymmetry — schema is generic, code-derived knowledge is table-specific, runtime introspection is query-specific. Different layers answer different questions.
Sibling concepts elsewhere in the wiki¶
The layered-context shape recurs across data agents:
- Databricks Genie uses semantic-enterprise-context + verifiable-test-gap-data-queries (concepts/semantic-enterprise-context, concepts/verifiable-test-gap-data-queries) — Skipper Layer 4 + 5 analogues.
- Grafana Assistant uses per-service prebuilt context with weekly refresh (concepts/agent-infrastructure-memory, patterns/precomputed-agent-context-files) — Skipper Layer 1+2+3 prebuilt analogue.
- Pinterest Analytics Agent uses PinCat-driven schema grounding + governance-aware ranking — Skipper Layer 1 + 4 analogue.
The Cloudflare framing distinctively names all five layers explicitly and characterises each one's lifecycle / cost / specificity. It's the most articulated layered-context recipe in the corpus.
Seen in¶
- sources/2026-05-28-cloudflare-how-we-built-cloudflares-data-platform-and-an-ai-agent-on-top-of-it — canonical wiki instance. Skipper's five layers as the structural shape; Layer 3 (code-derived) as the architecturally distinctive contribution.
Related¶
- systems/cloudflare-skipper — canonical wiki instance.
- systems/cloudflare-town-lake — the platform.
- systems/cloudflare-transformer-elt — emits Layer 3 context.
- systems/datahub — Layer 1, 2, 3 substrate.
- systems/trino — Layer 5 substrate.
- systems/databricks-genie — sibling analytical agent.
- systems/grafana-assistant — sibling infrastructure agent.
- concepts/code-as-context-for-data-agents — Layer 3 isolated.
- concepts/agent-infrastructure-memory — sibling pattern.
- concepts/semantic-enterprise-context — sibling concept.
- concepts/specialized-knowledge-search — sibling concept.
- patterns/code-mode-mcp-for-data-agent — the MCP integration that surfaces these layers.
- patterns/semantic-context-grounded-search-index — sibling pattern.