Skip to content

PATTERN Cited by 1 source

Semantic-context-grounded search index

Semantic-context-grounded search index is the discovery pattern of building search indices over a workspace's existing data assets (tables + dashboards + notebooks + documents + files), using the rich relationships between those assets — not just text similarity — as the ranking substrate. Multiple indices run in parallel with rich metadata signals; the agent then picks among the top candidates with source-of-truth disambiguation reasoning.

The pattern operationalises concepts/specialized-knowledge-search (the Genie technique) and exploits concepts/semantic-enterprise-context as its substrate. Disclosed by Databricks for Genie in the 2026-05-08 post; "up to 40% improvement" on table-discovery benchmarks over conventional search.

The pattern

   Workspace assets
   (tables, dashboards, notebooks, documents, files)
   ┌─────────────────────────────────────────────────────────┐
   │  Index construction (offline + incremental)              │
   │  ─────────────────────────────────────────────────       │
   │  Index 1: Table schema + names (embedding + sparse)      │
   │  Index 2: Dashboard text + queries (embedding)           │
   │  Index 3: Document text (embedding)                      │
   │  Index 4: Lineage graph (graph DB)                       │
   │  Index 5: Catalog metadata (structured filter)           │
   └─────────────────────────────────────────────────────────┘
   User query → Search sub-agents (parallel) → Candidate set
                Source-of-truth disambiguation
                  Authoritative subset

Three properties make this a distinct pattern:

  1. Index source is the workspace itself — assets become the corpus.
  2. Multiple indices in parallel, each tuned to a different signal.
  3. Rich metadata signals layered on top — recency, ownership, tier, popularity, lineage authority.

A generic "vector search across all workspace text" approach treats each asset as an isolated document and ignores:

Signal type Example Generic vector search Semantic-context-grounded
Schema "This column is a foreign key to Customers" Lost (just text) Preserved
Lineage "This dashboard reads from this table" Lost First-class signal
Authority "This table is production-tier owned by finance" Lost First-class filter
Cross-modal Doc references column name → matches schema Lost (different text) Resolved via catalogue
Recency / popularity "This table was queried 50K times last week" Lost First-class signal

The semantic-context-grounded pattern keeps all of these as explicit ranking dimensions.

Index types — the multi-index decomposition

Not specifically disclosed for Genie which indices it uses, but a plausible decomposition (and a reasonable starting point for implementations):

Index Backing tech What it answers
Schema search Embedding + sparse over table names + column names + schemas "Which tables have a revenue column with a date dimension?"
Dashboard search Embedding over dashboard titles + descriptions + visible text "Which dashboards talk about Q4 revenue?"
Document search Embedding over wiki / Drive / SharePoint docs "Which docs explain how revenue is computed?"
Lineage search Graph DB (or materialised lineage table) "Which upstream tables feed this dashboard's revenue number?"
Notebook search Embedding over notebook code + prose "Has anyone analysed this metric before?"
Catalog filter Structured filter over Unity Catalog metadata "Show only production-tier tables owned by finance"

A user query fans out across these indices; each returns its own top-K candidates; a fusion step combines.

Metadata signals overlay

On top of similarity ranking, the pattern layers rich metadata signals to break ties + filter:

  • Recency — modification time, query timestamp.
  • Ownership — table / dashboard owner; team affiliation.
  • Tier / governance label — production vs experimental.
  • Popularity — query count, dashboard view count.
  • Freshness — data freshness (ETL staleness).
  • Authority markers"single source of truth" designation.

These are filters and rerankers, not similarity scores — they bound the candidate set to plausibly-authoritative subsets before the agent's source-of-truth disambiguation reasoning runs.

Disclosed result (Genie)

Figure 4 of the source post: "Comparison of Specialized Knowledge Search for Table Search performance""up to 40%" improvement on Genie's internal table-discovery benchmarks. Details of the benchmark composition (query distribution, dataset size, comparison method) not disclosed.

Composition with the agent's discovery phase

The semantic-context-grounded search index is what phase 1 (discovery) of the data-agent trajectory uses. Each search sub-agent invokes (some subset of) the indices; the parallel multi-agent discovery phase fans out across indices and aggregates the candidate set.

Phase 1 (Discovery):
  ├─→ Search sub-agent A → Schema index → top-K tables
  ├─→ Search sub-agent B → Dashboard index → top-K dashboards
  ├─→ Search sub-agent C → Document index → top-K docs
  └─→ Search sub-agent D → Lineage graph → upstream/downstream of candidates
                    Combined candidate set
            Source-of-truth disambiguation

Building and maintaining the index

Not disclosed in detail by Databricks but plausible operational shape:

Stage Trigger
Initial build First connect to a workspace
Incremental update Catalog change events (new table, new dashboard, schema evolution, ownership change)
Periodic full rebuild Drift / consistency reasons (weekly, monthly)
Usage signal refresh Query telemetry → popularity / recency signals updated continuously

For a fast-changing workspace, the index needs to keep up with catalog changes — stale indices return stale candidates.

When this fits / doesn't

Fits:

  • Workspace has a governed catalogue (Unity Catalog, DataHub, Amundsen) — relationships are queryable.
  • Heterogeneous asset types (tables + dashboards + docs).
  • Asset count high enough that single similarity index produces too much noise.
  • Operational discipline for keeping metadata current.

Doesn't fit:

  • Ungoverned data swamps — the "rich semantic context" the pattern exploits doesn't exist; falls back to text similarity at best.
  • Small workspaces — multi-index complexity isn't justified.
  • Workspaces dominated by one asset type (just tables, just docs) — the multi-index advantage shrinks.
  • Real-time-everything workspaces where the catalog can't keep up.

Anti-patterns

  • Single big embedding index over all assets — loses schema awareness, lineage, structured metadata.
  • Treating all relationships equally — a doc referencing a table is not the same signal as a dashboard reading from a table; the pattern needs typed edges.
  • No metadata filters — top-K from similarity alone surfaces deprecated / experimental / abandoned assets without filters.
  • Stale index — workspace evolves but index doesn't; agent retrieves outdated assets.
  • Index without telemetry — can't tune which signals matter; blind ranking.

Seen in

  • sources/2026-05-08-databricks-pushing-the-frontier-for-data-agents-with-geniecanonical first wiki disclosure of semantic-context-grounded search index as a named pattern. Genie's "Specialized Knowledge Search" — index built from existing workspace assets' rich semantic context, multiple indices in parallel, rich metadata signals, "up to 40%" table-discovery benchmark improvement (Figure 4). Positioned as the architectural response to the scale-of-data-discovery challenge unique to data agents.
Last updated · 542 distilled / 1,571 read