Skip to content

SYSTEM Cited by 3 sources

Dash Search Index

Definition

Dash Search Index (referred to in the post as the "Dash universal search index" or just "the Dash index") is the pre-built, cross-source retrieval substrate that backs Dropbox Dash. Documents, messages, and content from Dropbox itself plus integrated third-party apps (Confluence, Google Docs, Jira, Slack, …) are ingested into a single unified index, and a concepts/knowledge-graph is layered on top connecting people + activity + content across sources (Source: sources/2025-11-17-dropbox-how-dash-uses-context-engineering-for-smarter-ai).

Why it exists

Dash started as a conventional RAG setup — semantic + keyword search over indexed docs, one retrieval tool per source system. The agentic evolution ("please interpret / summarize / act on what you found") broke that architecture:

  • Per-app retrieval tools each carried their own schema + description, inflating the concepts/agent-context-window.
  • Cross-app questions (e.g. "status of the identity project" → Confluence + Google Docs + Jira all needed) required the agent to choose + merge, which it "often had to call all of them, but also didn't do so reliably."
  • Tool-count growth degraded tool-selection accuracy.

Dash's answer: consolidate all retrieval surfaces behind a single index and a single retrieval tool (patterns/unified-retrieval-tool).

Architecture (as described in the post)

  1. Ingest layer. Content from Dropbox + integrated third-party apps is continuously pulled into a unified index via per-app connectors (each with its own rate limits, API quirks, and ACLs). Per-content-type paths handle documents / images / PDFs / audio / video (patterns/multimodal-content-understanding): text → normalized to markdown, images → CLIP-class or full multimodal, PDFs → text + figures, audio → transcription, video → per-scene multimodal. All normalized into text + metadata
  2. embeddings so the downstream index is content-type-agnostic.
  3. Unified hybrid index. BM25 (lexical) + dense vectors both populated from the same content stream. Dash explicitly treats BM25 as the primary retrieval surface, with dense vectors as additive — "we found BM25 was very effective on its own with some relevant signals. It's an amazing workhorse for building out an index." (Source: sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash)
  4. Knowledge graph overlay. Nodes = people / documents / events / projects, resolved to canonical IDs across apps (patterns/canonical-entity-id). Edges = authorship, collaboration, activity, references. The graph is not stored in a graph database — Dash experimented with graph DBs and rejected them (latency, query-pattern mismatch, hybrid-retrieval-integration challenges). Instead, the graph is built asynchronously and flattened into "knowledge bundles" (per-entity / per-query-class pre-ranked digests) which are re-ingested through the same hybrid-index pipeline. Graph signals therefore ride on the same retrieval surface as documents, not via a separate query path.
  5. Query-time ranker. Multiple ranking passes; per-query + per-user relevance combining lexical match, vector similarity, and graph-derived signals. "Personalized and ACL'd to you." Quality measured via NDCG; graph-driven people-based ranking produced "really nice wins" on NDCG (Source: sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash).
  6. Single retrieval interface. One dash_search-class tool (the "super tool") exposed to the Dash agent (and, via systems/dash-mcp-server, to external agents).

Why build the graph in advance

"Building the index and graph in advance means Dash can focus on retrieval at runtime instead of rebuilding context, which makes the whole process faster and more efficient."

This is the load-bearing trade for context-engineered agent surfaces: pre-filter offline so runtime retrieval returns relevance-ranked slices directly into the context window, minimising both latency and tokens.

Formalized as patterns/precomputed-relevance-graph.

Relationship to other Dash systems

Caveats

  • Details sparse. The post is principles-level; no numbers on index size, query latency, graph scale, freshness budgets, or ACL-enforcement architecture.
  • ACL story not described. A unified cross-source index must enforce each source system's access control; the post doesn't say how.
  • Relevance evaluation not described. The ranker's quality is product-critical but not discussed; prior semantic-search post covers earlier-generation methodology.

Seen in

Last updated · 200 distilled / 1,178 read