PATTERN Cited by 1 source
Semantic-context-grounded search index¶
Semantic-context-grounded search index is the discovery pattern of building search indices over a workspace's existing data assets (tables + dashboards + notebooks + documents + files), using the rich relationships between those assets — not just text similarity — as the ranking substrate. Multiple indices run in parallel with rich metadata signals; the agent then picks among the top candidates with source-of-truth disambiguation reasoning.
The pattern operationalises concepts/specialized-knowledge-search (the Genie technique) and exploits concepts/semantic-enterprise-context as its substrate. Disclosed by Databricks for Genie in the 2026-05-08 post; "up to 40% improvement" on table-discovery benchmarks over conventional search.
The pattern¶
Workspace assets
(tables, dashboards, notebooks, documents, files)
│
▼
┌─────────────────────────────────────────────────────────┐
│ Index construction (offline + incremental) │
│ ───────────────────────────────────────────────── │
│ Index 1: Table schema + names (embedding + sparse) │
│ Index 2: Dashboard text + queries (embedding) │
│ Index 3: Document text (embedding) │
│ Index 4: Lineage graph (graph DB) │
│ Index 5: Catalog metadata (structured filter) │
└─────────────────────────────────────────────────────────┘
│
▼
User query → Search sub-agents (parallel) → Candidate set
│
▼
Source-of-truth disambiguation
│
▼
Authoritative subset
Three properties make this a distinct pattern:
- Index source is the workspace itself — assets become the corpus.
- Multiple indices in parallel, each tuned to a different signal.
- Rich metadata signals layered on top — recency, ownership, tier, popularity, lineage authority.
Why this beats generic vector search¶
A generic "vector search across all workspace text" approach treats each asset as an isolated document and ignores:
| Signal type | Example | Generic vector search | Semantic-context-grounded |
|---|---|---|---|
| Schema | "This column is a foreign key to Customers" | Lost (just text) | Preserved |
| Lineage | "This dashboard reads from this table" | Lost | First-class signal |
| Authority | "This table is production-tier owned by finance" | Lost | First-class filter |
| Cross-modal | Doc references column name → matches schema | Lost (different text) | Resolved via catalogue |
| Recency / popularity | "This table was queried 50K times last week" | Lost | First-class signal |
The semantic-context-grounded pattern keeps all of these as explicit ranking dimensions.
Index types — the multi-index decomposition¶
Not specifically disclosed for Genie which indices it uses, but a plausible decomposition (and a reasonable starting point for implementations):
| Index | Backing tech | What it answers |
|---|---|---|
| Schema search | Embedding + sparse over table names + column names + schemas | "Which tables have a revenue column with a date dimension?" |
| Dashboard search | Embedding over dashboard titles + descriptions + visible text | "Which dashboards talk about Q4 revenue?" |
| Document search | Embedding over wiki / Drive / SharePoint docs | "Which docs explain how revenue is computed?" |
| Lineage search | Graph DB (or materialised lineage table) | "Which upstream tables feed this dashboard's revenue number?" |
| Notebook search | Embedding over notebook code + prose | "Has anyone analysed this metric before?" |
| Catalog filter | Structured filter over Unity Catalog metadata | "Show only production-tier tables owned by finance" |
A user query fans out across these indices; each returns its own top-K candidates; a fusion step combines.
Metadata signals overlay¶
On top of similarity ranking, the pattern layers rich metadata signals to break ties + filter:
- Recency — modification time, query timestamp.
- Ownership — table / dashboard owner; team affiliation.
- Tier / governance label — production vs experimental.
- Popularity — query count, dashboard view count.
- Freshness — data freshness (ETL staleness).
- Authority markers — "single source of truth" designation.
These are filters and rerankers, not similarity scores — they bound the candidate set to plausibly-authoritative subsets before the agent's source-of-truth disambiguation reasoning runs.
Disclosed result (Genie)¶
Figure 4 of the source post: "Comparison of Specialized Knowledge Search for Table Search performance" — "up to 40%" improvement on Genie's internal table-discovery benchmarks. Details of the benchmark composition (query distribution, dataset size, comparison method) not disclosed.
Composition with the agent's discovery phase¶
The semantic-context-grounded search index is what phase 1 (discovery) of the data-agent trajectory uses. Each search sub-agent invokes (some subset of) the indices; the parallel multi-agent discovery phase fans out across indices and aggregates the candidate set.
Phase 1 (Discovery):
├─→ Search sub-agent A → Schema index → top-K tables
├─→ Search sub-agent B → Dashboard index → top-K dashboards
├─→ Search sub-agent C → Document index → top-K docs
└─→ Search sub-agent D → Lineage graph → upstream/downstream of candidates
│
▼
Combined candidate set
│
▼
Source-of-truth disambiguation
Building and maintaining the index¶
Not disclosed in detail by Databricks but plausible operational shape:
| Stage | Trigger |
|---|---|
| Initial build | First connect to a workspace |
| Incremental update | Catalog change events (new table, new dashboard, schema evolution, ownership change) |
| Periodic full rebuild | Drift / consistency reasons (weekly, monthly) |
| Usage signal refresh | Query telemetry → popularity / recency signals updated continuously |
For a fast-changing workspace, the index needs to keep up with catalog changes — stale indices return stale candidates.
When this fits / doesn't¶
Fits:
- Workspace has a governed catalogue (Unity Catalog, DataHub, Amundsen) — relationships are queryable.
- Heterogeneous asset types (tables + dashboards + docs).
- Asset count high enough that single similarity index produces too much noise.
- Operational discipline for keeping metadata current.
Doesn't fit:
- Ungoverned data swamps — the "rich semantic context" the pattern exploits doesn't exist; falls back to text similarity at best.
- Small workspaces — multi-index complexity isn't justified.
- Workspaces dominated by one asset type (just tables, just docs) — the multi-index advantage shrinks.
- Real-time-everything workspaces where the catalog can't keep up.
Anti-patterns¶
- Single big embedding index over all assets — loses schema awareness, lineage, structured metadata.
- Treating all relationships equally — a doc referencing a table is not the same signal as a dashboard reading from a table; the pattern needs typed edges.
- No metadata filters — top-K from similarity alone surfaces deprecated / experimental / abandoned assets without filters.
- Stale index — workspace evolves but index doesn't; agent retrieves outdated assets.
- Index without telemetry — can't tune which signals matter; blind ranking.
Relationship to related patterns¶
- patterns/four-phase-data-agent-trajectory — phase 1 consumes this pattern's output.
- patterns/parallel-trajectory-sampling-and-aggregation — each trajectory's discovery phase uses this index.
- patterns/llm-per-subagent-with-optimized-prompts — search sub-agents (using this index) get their own (LLM, prompt) optimisation.
Seen in¶
- sources/2026-05-08-databricks-pushing-the-frontier-for-data-agents-with-genie — canonical first wiki disclosure of semantic-context-grounded search index as a named pattern. Genie's "Specialized Knowledge Search" — index built from existing workspace assets' rich semantic context, multiple indices in parallel, rich metadata signals, "up to 40%" table-discovery benchmark improvement (Figure 4). Positioned as the architectural response to the scale-of-data-discovery challenge unique to data agents.