Skip to content

CONCEPT Cited by 4 sources

Knowledge graph

A knowledge graph is a data structure that captures relationships between entities (people, documents, events, projects, activities) rather than just their individual contents. In a retrieval or agentic context it serves as the relevance substrate: the graph's edges encode how entities relate, and queries are ranked by graph-distance / relationship-type / user-centric edges, not just by lexical or semantic match against content.

Named instance in this wiki: Dash

Dropbox's Dash builds one universal index across Dropbox + integrated third-party sources and layers a knowledge graph on top (Source: sources/2025-11-17-dropbox-how-dash-uses-context-engineering-for-smarter-ai):

"A knowledge graph maps relationships between these sources so the system can understand how different pieces of information are connected. These relationships help rank results based on what matters most for each query and each user."

The graph is built in advance, not computed per-query — an explicit precomputation-vs-runtime trade made to minimise concepts/agent-context-window consumption at retrieval time (patterns/precomputed-relevance-graph).

Why agents benefit specifically

A text-similarity-only retriever returns things that look like the query. A knowledge-graph-ranked retriever returns things that matter for the querier right now:

  • People edges — the user's team, direct reports, the person who last edited this doc.
  • Activity edges — what the user viewed / edited / shared recently; which docs their collaborators touched.
  • Content edges — which docs reference each other, which projects share a DRI.

For an agent, this means the retrieved slice is already pre-filtered for relevance, so the agent spends its context budget reasoning about a few good candidates instead of sifting through a long semantic-match list.

Design implications

  • Offline graph builder. Ingests must populate + maintain edges continuously (people / activity / content update constantly).
  • User-centric ranking. Many queries are meaningful only under a specific user identity; the graph has to be query-time personalized, not globally static.
  • Access control on edges. Edges must respect source-system ACLs — "this doc cites that one" is only a legitimate edge if the user can see both.
  • Freshness vs cost. Graph updates are expensive; staleness budgets are domain-specific (minutes for activity, hours for content, longer for org structure).

Distinction from vector-index-only retrieval

A vector index answers "what looks like this?". A knowledge graph answers "what is connected to this, and how?". Production retrieval systems typically combine both — vector search finds semantic matches; graph edges + query-time user context rank them. Dash explicitly combines the two: "combine data from multiple sources into one unified index, then layered a knowledge graph on top."

Seen in

  • sources/2025-11-17-dropbox-how-dash-uses-context-engineering-for-smarter-ai — Dash's knowledge graph over the unified search index as the relevance substrate for agent-driven retrieval.
  • sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash — companion Dropbox post adding three implementation insights absent from the context-engineering post: (1) Canonical entity IDs across apps (patterns/canonical-entity-id) are the load-bearing graph primitive; per-source identities get resolved to one canonical ID per person / doc / project. "Every app… has its own concept or definition of people, and so coming up with a canonical ID for who someone is is very, very impactful for us overall." Dash reports measured NDCG wins from "just the people-based result." (2) Not stored in a graph database. Dash experimented with graph DBs and found "the latency and query pattern were a challenge. Trying to figure out that hybrid retrieval was a challenge." Instead, they build the relationships asynchronously and flatten them into "knowledge bundles" — summary embeddings / contextual digests — which are then fed through the same index pipeline as the rest of the content (hybrid BM25 + vector chunks + embeddings). "It's not necessarily a graph, but think of it almost like an embedding—like a summary of that graph." (3) Token savings at runtime. In the MCP fixes section, knowledge graphs are named explicitly as a token-efficiency lever: "Modeling data within knowledge graphs can significantly cut our token usage as well, because you're really just getting the most relevant information for the query."

Architecture detail: why not a graph database?

Dropbox experimented with graph DBs for the knowledge graph and rejected them (Source: sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash). Three stated reasons:

  1. Latency. Graph traversal at Dash's query rates didn't meet the product's latency budget.
  2. Query pattern mismatch. The agentic query load doesn't look like classic graph-DB workloads; traversal-heavy queries performed poorly.
  3. Hybrid retrieval integration. Joining graph-DB results with the existing BM25 + vector index was "a challenge."

Their alternative: build the graph in memory / async pipeline, produce per-entity / per-query-class knowledge bundles (think: pre-computed contextual digests), feed those bundles into the same index pipeline as documents. At query time, the agent sees top-K bundle-and-doc hits — no separate graph query path. This is a key caveat for anyone considering graph-DB-for-RAG: Dash's production answer is "graph at ingest, flatten before query."

Second wiki framing: enterprise data-integration substrate (Netflix UDA)

Where Dash uses the knowledge graph as a retrieval relevance substrate for agents, Netflix's UDA — Unified Data Architecture uses a knowledge graph as an enterprise data-integration substrate for schemas and pipelines (Source: sources/2025-06-14-netflix-model-once-represent-everywhere-uda).

"We needed a data catalog unified with a schema registry, but with a hard requirement for semantic integration. Connecting business concepts to schemas and data containers in a graph-like structure, grounded in strong semantic foundations, naturally led us to consider a knowledge graph approach." — Netflix, UDA post

In UDA the graph's nodes are:

  • Business conceptsactor, movie, asset — authored as domain models in the Upper metamodel.
  • System domains — GraphQL, Avro, Data Mesh, Mappings — modelled in the same language as business concepts.
  • Data containers — the federated-GraphQL entities, Data Mesh sources, Iceberg tables, Java APIs where instance data lives.
  • Mappings — edges connecting domain concepts to data containers.

The graph is structurally named-graph- first: every named graph conforms to a governing named graph, all the way up to the self-hosting Upper metamodel (patterns/self-referencing-metamodel-bootstrap). The info model itself is in the graph.

What the UDA deployment buys:

  • Both schema registry + data catalog at once. One substrate holds the schemas and the mappings to live data — no separate systems to keep in sync.
  • Schema generation. Domain models transpile into GraphQL, Avro, SQL, RDF, and Java via a transpiler family (patterns/schema-transpilation-from-domain-model).
  • Pipeline auto-provisioning. Data-movement pipelines (GraphQL → Data Mesh, CDC → Iceberg) are generated from the mapping graph + system domains.
  • Self-service analytics. Sphere walks the graph from business-concept nodes to warehouse containers and generates SQL (patterns/graph-walk-sql-generation).
  • Control-plane promotion. "The conceptual model must become part of the control plane" — the knowledge graph is load-bearing, not documentation.

Key contrast with the Dash framing:

Dimension Dash (retrieval) UDA (integration)
Load Per-query agent retrieval Offline + continuous schema / pipeline generation
Stored as Flattened "knowledge bundles" via the same BM25+vector index RDF triples in named graphs with governance
Query Vector + bundle lookup (the graph is compiled out before query) SPARQL / Java API / federated GraphQL / generated SQL via graph walk
Scale concern Latency at query time Correctness + modularity of mappings
Pain point addressed Relevance + personalization of retrieval Duplicated + inconsistent models across many data systems
Substrate Not a graph DB (flattened bundles) RDF + SHACL + Upper's restricted subset

Both framings are legitimate; they address non-overlapping engineering problems with the same data structure.

Third wiki framing: MDM data-model-definition substrate (Zalando)

A third use of the same data structure appears in Zalando's 2021-07-28 post on : the knowledge graph as the authoring substrate for the logical data model of a golden record in MDM.

"By using knowledge graphs for a live-data representation of all systems' logical data models and how they map to a semantic layer of business concepts, we are able to automatically generate the logical data model of the golden record inside the knowledge graph with additional information on how it connects to systems' data model." — Zalando MDM post

The graph's nodes are System, Table, Column, Concept, Attribute, and Relationship. Domain experts author the column → concept / attribute / relationship mappings; a Python generator walks the graph and emits both the logical data model of the golden record and the per-source-system transformation data model. The graph is stored + visualised in Neo4j; the visualisation is the primary business-engineering communication artifact (patterns/visual-graph-for-business-engineering-alignment).

Key contrast with the two earlier framings:

Dimension Dash (retrieval) UDA (integration) Zalando MDM (data modeling)
Load Per-query agent retrieval Offline + continuous schema / pipeline generation Design-time schema + mapping generation
Stored as Flattened "knowledge bundles" via the same BM25+vector index RDF triples in named graphs with governance Neo4j property graph
Scale Many docs per user; query-rate sensitive Enterprise; many domains, many consumers Tens of tables, hundreds of columns
Tooling Hybrid BM25+vector index RDF + SHACL + Upper metamodel Neo4j property graph
Output Ranked retrieval results for agents Generated schemas + auto-provisioned pipelines Golden-record logical data model + transformation data model
Pain point addressed Relevance + personalization of retrieval Duplicated + inconsistent models across many data systems Business-engineering communication gap; manual diagram maintenance
Ceremony level Low (flattened bundles, no graph DB) High (semantic-web stack) Medium (property graph, no RDF/SHACL)

All three framings treat the graph as the modeling substrate, but pick different storage + query strategies depending on whether the load is at query time (Dash), schema generation time (UDA), or design time (Zalando).

Fourth wiki framing: Trust & Safety identity-resolution substrate (Airbnb)

Airbnb's knowledge graph infrastructure uses the graph structure as a Trust & Safety substrate — specifically for identity resolution and relationship understanding at massive scale (7B nodes, 11B edges). Unlike Dash (retrieval), UDA (integration), or Zalando (MDM), the Airbnb framing is an OLTP graph workload with strict latency requirements on 4–8 hop traversals for fraud detection and linked-account identification (Source: sources/2026-05-19-airbnb-scaling-identity-graph-unified-knowledge-graph-infrastructure).

Key contrast with other framings:

Dimension Dash (retrieval) UDA (integration) Zalando MDM Airbnb (identity resolution)
Load Per-query agent retrieval Offline schema/pipeline gen Design-time modeling Real-time OLTP (4–8 hop traversals)
Scale Many docs per user Enterprise domains Tens of tables 7B nodes, 11B edges, +5M edges/day
Stored as Flattened bundles RDF named graphs Neo4j property graph JanusGraph + DynamoDB
Key challenge Relevance + personalization Model consistency Business-engineering gap Long-tail latency on dense subgraphs

The Airbnb instance is the wiki's first canonical disclosure of a knowledge graph as a graph-database-scale OLTP workload with explicit high-fanout latency challenges.

Seen in

Last updated · 542 distilled / 1,571 read