PATTERN Cited by 1 source

Semantic-context-grounded search index¶

Semantic-context-grounded search index is the discovery pattern of building search indices over a workspace's existing data assets (tables + dashboards + notebooks + documents + files), using the rich relationships between those assets — not just text similarity — as the ranking substrate. Multiple indices run in parallel with rich metadata signals; the agent then picks among the top candidates with source-of-truth disambiguation reasoning.

The pattern operationalises concepts/specialized-knowledge-search (the Genie technique) and exploits concepts/semantic-enterprise-context as its substrate. Disclosed by Databricks for Genie in the 2026-05-08 post; "up to 40% improvement" on table-discovery benchmarks over conventional search.

The pattern¶

   Workspace assets
   (tables, dashboards, notebooks, documents, files)
                         │
                         ▼
   ┌─────────────────────────────────────────────────────────┐
   │  Index construction (offline + incremental)              │
   │  ─────────────────────────────────────────────────       │
   │  Index 1: Table schema + names (embedding + sparse)      │
   │  Index 2: Dashboard text + queries (embedding)           │
   │  Index 3: Document text (embedding)                      │
   │  Index 4: Lineage graph (graph DB)                       │
   │  Index 5: Catalog metadata (structured filter)           │
   └─────────────────────────────────────────────────────────┘
                         │
                         ▼
   User query → Search sub-agents (parallel) → Candidate set
                         │
                         ▼
                Source-of-truth disambiguation
                         │
                         ▼
                  Authoritative subset

Three properties make this a distinct pattern:

Index source is the workspace itself — assets become the corpus.
Multiple indices in parallel, each tuned to a different signal.
Rich metadata signals layered on top — recency, ownership, tier, popularity, lineage authority.

Why this beats generic vector search¶

A generic "vector search across all workspace text" approach treats each asset as an isolated document and ignores:

Signal type	Example	Generic vector search	Semantic-context-grounded
Schema	"This column is a foreign key to Customers"	Lost (just text)	Preserved
Lineage	"This dashboard reads from this table"	Lost	First-class signal
Authority	"This table is production-tier owned by finance"	Lost	First-class filter
Cross-modal	Doc references column name → matches schema	Lost (different text)	Resolved via catalogue
Recency / popularity	"This table was queried 50K times last week"	Lost	First-class signal

The semantic-context-grounded pattern keeps all of these as explicit ranking dimensions.

Index types — the multi-index decomposition¶

Not specifically disclosed for Genie which indices it uses, but a plausible decomposition (and a reasonable starting point for implementations):

Index	Backing tech	What it answers
Schema search	Embedding + sparse over table names + column names + schemas	"Which tables have a `revenue` column with a `date` dimension?"
Dashboard search	Embedding over dashboard titles + descriptions + visible text	"Which dashboards talk about Q4 revenue?"
Document search	Embedding over wiki / Drive / SharePoint docs	"Which docs explain how revenue is computed?"
Lineage search	Graph DB (or materialised lineage table)	"Which upstream tables feed this dashboard's revenue number?"
Notebook search	Embedding over notebook code + prose	"Has anyone analysed this metric before?"
Catalog filter	Structured filter over Unity Catalog metadata	"Show only production-tier tables owned by finance"

A user query fans out across these indices; each returns its own top-K candidates; a fusion step combines.

Metadata signals overlay¶

On top of similarity ranking, the pattern layers rich metadata signals to break ties + filter:

Recency — modification time, query timestamp.
Ownership — table / dashboard owner; team affiliation.
Tier / governance label — production vs experimental.
Popularity — query count, dashboard view count.
Freshness — data freshness (ETL staleness).
Authority markers — "single source of truth" designation.

These are filters and rerankers, not similarity scores — they bound the candidate set to plausibly-authoritative subsets before the agent's source-of-truth disambiguation reasoning runs.

Disclosed result (Genie)¶

Figure 4 of the source post: "Comparison of Specialized Knowledge Search for Table Search performance" — "up to 40%" improvement on Genie's internal table-discovery benchmarks. Details of the benchmark composition (query distribution, dataset size, comparison method) not disclosed.

Composition with the agent's discovery phase¶

The semantic-context-grounded search index is what phase 1 (discovery) of the data-agent trajectory uses. Each search sub-agent invokes (some subset of) the indices; the parallel multi-agent discovery phase fans out across indices and aggregates the candidate set.

Phase 1 (Discovery):
  ├─→ Search sub-agent A → Schema index → top-K tables
  ├─→ Search sub-agent B → Dashboard index → top-K dashboards
  ├─→ Search sub-agent C → Document index → top-K docs
  └─→ Search sub-agent D → Lineage graph → upstream/downstream of candidates
                            │
                            ▼
                    Combined candidate set
                            │
                            ▼
            Source-of-truth disambiguation

Building and maintaining the index¶

Not disclosed in detail by Databricks but plausible operational shape:

Stage	Trigger
Initial build	First connect to a workspace
Incremental update	Catalog change events (new table, new dashboard, schema evolution, ownership change)
Periodic full rebuild	Drift / consistency reasons (weekly, monthly)
Usage signal refresh	Query telemetry → popularity / recency signals updated continuously

For a fast-changing workspace, the index needs to keep up with catalog changes — stale indices return stale candidates.

When this fits / doesn't¶

Fits:

Workspace has a governed catalogue (Unity Catalog, DataHub, Amundsen) — relationships are queryable.
Heterogeneous asset types (tables + dashboards + docs).
Asset count high enough that single similarity index produces too much noise.
Operational discipline for keeping metadata current.

Doesn't fit:

Ungoverned data swamps — the "rich semantic context" the pattern exploits doesn't exist; falls back to text similarity at best.
Small workspaces — multi-index complexity isn't justified.
Workspaces dominated by one asset type (just tables, just docs) — the multi-index advantage shrinks.
Real-time-everything workspaces where the catalog can't keep up.

Anti-patterns¶

Single big embedding index over all assets — loses schema awareness, lineage, structured metadata.
Treating all relationships equally — a doc referencing a table is not the same signal as a dashboard reading from a table; the pattern needs typed edges.
No metadata filters — top-K from similarity alone surfaces deprecated / experimental / abandoned assets without filters.
Stale index — workspace evolves but index doesn't; agent retrieves outdated assets.
Index without telemetry — can't tune which signals matter; blind ranking.

patterns/four-phase-data-agent-trajectory — phase 1 consumes this pattern's output.
patterns/parallel-trajectory-sampling-and-aggregation — each trajectory's discovery phase uses this index.
patterns/llm-per-subagent-with-optimized-prompts — search sub-agents (using this index) get their own (LLM, prompt) optimisation.

Seen in¶

sources/2026-05-08-databricks-pushing-the-frontier-for-data-agents-with-genie — canonical first wiki disclosure of semantic-context-grounded search index as a named pattern. Genie's "Specialized Knowledge Search" — index built from existing workspace assets' rich semantic context, multiple indices in parallel, rich metadata signals, "up to 40%" table-discovery benchmark improvement (Figure 4). Positioned as the architectural response to the scale-of-data-discovery challenge unique to data agents.