CONCEPT Cited by 2 sources
Knowledge graph¶
A knowledge graph is a data structure that captures relationships between entities (people, documents, events, projects, activities) rather than just their individual contents. In a retrieval or agentic context it serves as the relevance substrate: the graph's edges encode how entities relate, and queries are ranked by graph-distance / relationship-type / user-centric edges, not just by lexical or semantic match against content.
Named instance in this wiki: Dash¶
Dropbox's Dash builds one universal index across Dropbox + integrated third-party sources and layers a knowledge graph on top (Source: sources/2025-11-17-dropbox-how-dash-uses-context-engineering-for-smarter-ai):
"A knowledge graph maps relationships between these sources so the system can understand how different pieces of information are connected. These relationships help rank results based on what matters most for each query and each user."
The graph is built in advance, not computed per-query — an explicit precomputation-vs-runtime trade made to minimise concepts/agent-context-window consumption at retrieval time (patterns/precomputed-relevance-graph).
Why agents benefit specifically¶
A text-similarity-only retriever returns things that look like the query. A knowledge-graph-ranked retriever returns things that matter for the querier right now:
- People edges — the user's team, direct reports, the person who last edited this doc.
- Activity edges — what the user viewed / edited / shared recently; which docs their collaborators touched.
- Content edges — which docs reference each other, which projects share a DRI.
For an agent, this means the retrieved slice is already pre-filtered for relevance, so the agent spends its context budget reasoning about a few good candidates instead of sifting through a long semantic-match list.
Design implications¶
- Offline graph builder. Ingests must populate + maintain edges continuously (people / activity / content update constantly).
- User-centric ranking. Many queries are meaningful only under a specific user identity; the graph has to be query-time personalized, not globally static.
- Access control on edges. Edges must respect source-system ACLs — "this doc cites that one" is only a legitimate edge if the user can see both.
- Freshness vs cost. Graph updates are expensive; staleness budgets are domain-specific (minutes for activity, hours for content, longer for org structure).
Distinction from vector-index-only retrieval¶
A vector index answers "what looks like this?". A knowledge graph answers "what is connected to this, and how?". Production retrieval systems typically combine both — vector search finds semantic matches; graph edges + query-time user context rank them. Dash explicitly combines the two: "combine data from multiple sources into one unified index, then layered a knowledge graph on top."
Seen in¶
- sources/2025-11-17-dropbox-how-dash-uses-context-engineering-for-smarter-ai — Dash's knowledge graph over the unified search index as the relevance substrate for agent-driven retrieval.
- sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash — companion Dropbox post adding three implementation insights absent from the context-engineering post: (1) Canonical entity IDs across apps (patterns/canonical-entity-id) are the load-bearing graph primitive; per-source identities get resolved to one canonical ID per person / doc / project. "Every app… has its own concept or definition of people, and so coming up with a canonical ID for who someone is is very, very impactful for us overall." Dash reports measured NDCG wins from "just the people-based result." (2) Not stored in a graph database. Dash experimented with graph DBs and found "the latency and query pattern were a challenge. Trying to figure out that hybrid retrieval was a challenge." Instead, they build the relationships asynchronously and flatten them into "knowledge bundles" — summary embeddings / contextual digests — which are then fed through the same index pipeline as the rest of the content (hybrid BM25 + vector chunks + embeddings). "It's not necessarily a graph, but think of it almost like an embedding—like a summary of that graph." (3) Token savings at runtime. In the MCP fixes section, knowledge graphs are named explicitly as a token-efficiency lever: "Modeling data within knowledge graphs can significantly cut our token usage as well, because you're really just getting the most relevant information for the query."
⚠️ Architecture detail: why not a graph database?¶
Dropbox experimented with graph DBs for the knowledge graph and rejected them (Source: sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash). Three stated reasons:
- Latency. Graph traversal at Dash's query rates didn't meet the product's latency budget.
- Query pattern mismatch. The agentic query load doesn't look like classic graph-DB workloads; traversal-heavy queries performed poorly.
- Hybrid retrieval integration. Joining graph-DB results with the existing BM25 + vector index was "a challenge."
Their alternative: build the graph in memory / async pipeline, produce per-entity / per-query-class knowledge bundles (think: pre-computed contextual digests), feed those bundles into the same index pipeline as documents. At query time, the agent sees top-K bundle-and-doc hits — no separate graph query path. This is a key caveat for anyone considering graph-DB-for-RAG: Dash's production answer is "graph at ingest, flatten before query."
Related¶
- patterns/precomputed-relevance-graph — the pattern framing: build graph + ranking offline, not at query time.
- patterns/canonical-entity-id — entity-resolution primitive that makes the graph's nodes coherent across sources.
- systems/dash-search-index — Dash's unified index + graph.
- concepts/hybrid-retrieval-bm25-vectors — the retrieval substrate the graph's "knowledge bundles" get fused into.
- concepts/ndcg — ranking metric Dash reports graph-derived wins on.
- concepts/context-engineering — parent discipline; knowledge graph is one tactic for filtering context to relevance.