CONCEPT Cited by 4 sources

Knowledge graph¶

A knowledge graph is a data structure that captures relationships between entities (people, documents, events, projects, activities) rather than just their individual contents. In a retrieval or agentic context it serves as the relevance substrate: the graph's edges encode how entities relate, and queries are ranked by graph-distance / relationship-type / user-centric edges, not just by lexical or semantic match against content.

Named instance in this wiki: Dash¶

Dropbox's Dash builds one universal index across Dropbox + integrated third-party sources and layers a knowledge graph on top (Source: sources/2025-11-17-dropbox-how-dash-uses-context-engineering-for-smarter-ai):

"A knowledge graph maps relationships between these sources so the system can understand how different pieces of information are connected. These relationships help rank results based on what matters most for each query and each user."

The graph is built in advance, not computed per-query — an explicit precomputation-vs-runtime trade made to minimise concepts/agent-context-window consumption at retrieval time (patterns/precomputed-relevance-graph).

Why agents benefit specifically¶

A text-similarity-only retriever returns things that look like the query. A knowledge-graph-ranked retriever returns things that matter for the querier right now:

People edges — the user's team, direct reports, the person who last edited this doc.
Activity edges — what the user viewed / edited / shared recently; which docs their collaborators touched.
Content edges — which docs reference each other, which projects share a DRI.

For an agent, this means the retrieved slice is already pre-filtered for relevance, so the agent spends its context budget reasoning about a few good candidates instead of sifting through a long semantic-match list.

Design implications¶

Offline graph builder. Ingests must populate + maintain edges continuously (people / activity / content update constantly).
User-centric ranking. Many queries are meaningful only under a specific user identity; the graph has to be query-time personalized, not globally static.
Access control on edges. Edges must respect source-system ACLs — "this doc cites that one" is only a legitimate edge if the user can see both.
Freshness vs cost. Graph updates are expensive; staleness budgets are domain-specific (minutes for activity, hours for content, longer for org structure).

Distinction from vector-index-only retrieval¶

A vector index answers "what looks like this?". A knowledge graph answers "what is connected to this, and how?". Production retrieval systems typically combine both — vector search finds semantic matches; graph edges + query-time user context rank them. Dash explicitly combines the two: "combine data from multiple sources into one unified index, then layered a knowledge graph on top."

Seen in¶

sources/2025-11-17-dropbox-how-dash-uses-context-engineering-for-smarter-ai — Dash's knowledge graph over the unified search index as the relevance substrate for agent-driven retrieval.
sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash — companion Dropbox post adding three implementation insights absent from the context-engineering post: (1) Canonical entity IDs across apps (patterns/canonical-entity-id) are the load-bearing graph primitive; per-source identities get resolved to one canonical ID per person / doc / project. "Every app… has its own concept or definition of people, and so coming up with a canonical ID for who someone is is very, very impactful for us overall." Dash reports measured NDCG wins from "just the people-based result." (2) Not stored in a graph database. Dash experimented with graph DBs and found "the latency and query pattern were a challenge. Trying to figure out that hybrid retrieval was a challenge." Instead, they build the relationships asynchronously and flatten them into "knowledge bundles" — summary embeddings / contextual digests — which are then fed through the same index pipeline as the rest of the content (hybrid BM25 + vector chunks + embeddings). "It's not necessarily a graph, but think of it almost like an embedding—like a summary of that graph." (3) Token savings at runtime. In the MCP fixes section, knowledge graphs are named explicitly as a token-efficiency lever: "Modeling data within knowledge graphs can significantly cut our token usage as well, because you're really just getting the most relevant information for the query."

Architecture detail: why not a graph database?¶

Dropbox experimented with graph DBs for the knowledge graph and rejected them (Source: sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash). Three stated reasons:

Latency. Graph traversal at Dash's query rates didn't meet the product's latency budget.
Query pattern mismatch. The agentic query load doesn't look like classic graph-DB workloads; traversal-heavy queries performed poorly.
Hybrid retrieval integration. Joining graph-DB results with the existing BM25 + vector index was "a challenge."

Their alternative: build the graph in memory / async pipeline, produce per-entity / per-query-class knowledge bundles (think: pre-computed contextual digests), feed those bundles into the same index pipeline as documents. At query time, the agent sees top-K bundle-and-doc hits — no separate graph query path. This is a key caveat for anyone considering graph-DB-for-RAG: Dash's production answer is "graph at ingest, flatten before query."

Second wiki framing: enterprise data-integration substrate (Netflix UDA)¶

Where Dash uses the knowledge graph as a retrieval relevance substrate for agents, Netflix's UDA — Unified Data Architecture uses a knowledge graph as an enterprise data-integration substrate for schemas and pipelines (Source: sources/2025-06-14-netflix-model-once-represent-everywhere-uda).

"We needed a data catalog unified with a schema registry, but with a hard requirement for semantic integration. Connecting business concepts to schemas and data containers in a graph-like structure, grounded in strong semantic foundations, naturally led us to consider a knowledge graph approach." — Netflix, UDA post

In UDA the graph's nodes are:

Business concepts — actor, movie, asset — authored as domain models in the Upper metamodel.
System domains — GraphQL, Avro, Data Mesh, Mappings — modelled in the same language as business concepts.
Data containers — the federated-GraphQL entities, Data Mesh sources, Iceberg tables, Java APIs where instance data lives.
Mappings — edges connecting domain concepts to data containers.

The graph is structurally named-graph- first: every named graph conforms to a governing named graph, all the way up to the self-hosting Upper metamodel (patterns/self-referencing-metamodel-bootstrap). The info model itself is in the graph.

What the UDA deployment buys:

Both schema registry + data catalog at once. One substrate holds the schemas and the mappings to live data — no separate systems to keep in sync.
Schema generation. Domain models transpile into GraphQL, Avro, SQL, RDF, and Java via a transpiler family (patterns/schema-transpilation-from-domain-model).
Pipeline auto-provisioning. Data-movement pipelines (GraphQL → Data Mesh, CDC → Iceberg) are generated from the mapping graph + system domains.
Self-service analytics. Sphere walks the graph from business-concept nodes to warehouse containers and generates SQL (patterns/graph-walk-sql-generation).
Control-plane promotion. "The conceptual model must become part of the control plane" — the knowledge graph is load-bearing, not documentation.

Key contrast with the Dash framing:

Dimension	Dash (retrieval)	UDA (integration)
Load	Per-query agent retrieval	Offline + continuous schema / pipeline generation
Stored as	Flattened "knowledge bundles" via the same BM25+vector index	RDF triples in named graphs with governance
Query	Vector + bundle lookup (the graph is compiled out before query)	SPARQL / Java API / federated GraphQL / generated SQL via graph walk
Scale concern	Latency at query time	Correctness + modularity of mappings
Pain point addressed	Relevance + personalization of retrieval	Duplicated + inconsistent models across many data systems
Substrate	Not a graph DB (flattened bundles)	RDF + SHACL + Upper's restricted subset

Both framings are legitimate; they address non-overlapping engineering problems with the same data structure.

Third wiki framing: MDM data-model-definition substrate (Zalando)¶

A third use of the same data structure appears in Zalando's 2021-07-28 post on : the knowledge graph as the authoring substrate for the logical data model of a golden record in MDM.

"By using knowledge graphs for a live-data representation of all systems' logical data models and how they map to a semantic layer of business concepts, we are able to automatically generate the logical data model of the golden record inside the knowledge graph with additional information on how it connects to systems' data model." — Zalando MDM post

The graph's nodes are System, Table, Column, Concept, Attribute, and Relationship. Domain experts author the column → concept / attribute / relationship mappings; a Python generator walks the graph and emits both the logical data model of the golden record and the per-source-system transformation data model. The graph is stored + visualised in Neo4j; the visualisation is the primary business-engineering communication artifact (patterns/visual-graph-for-business-engineering-alignment).

Key contrast with the two earlier framings:

Dimension	Dash (retrieval)	UDA (integration)	Zalando MDM (data modeling)
Load	Per-query agent retrieval	Offline + continuous schema / pipeline generation	Design-time schema + mapping generation
Stored as	Flattened "knowledge bundles" via the same BM25+vector index	RDF triples in named graphs with governance	Neo4j property graph
Scale	Many docs per user; query-rate sensitive	Enterprise; many domains, many consumers	Tens of tables, hundreds of columns
Tooling	Hybrid BM25+vector index	RDF + SHACL + Upper metamodel	Neo4j property graph
Output	Ranked retrieval results for agents	Generated schemas + auto-provisioned pipelines	Golden-record logical data model + transformation data model
Pain point addressed	Relevance + personalization of retrieval	Duplicated + inconsistent models across many data systems	Business-engineering communication gap; manual diagram maintenance
Ceremony level	Low (flattened bundles, no graph DB)	High (semantic-web stack)	Medium (property graph, no RDF/SHACL)

All three framings treat the graph as the modeling substrate, but pick different storage + query strategies depending on whether the load is at query time (Dash), schema generation time (UDA), or design time (Zalando).

Fourth wiki framing: Trust & Safety identity-resolution substrate (Airbnb)¶

Airbnb's knowledge graph infrastructure uses the graph structure as a Trust & Safety substrate — specifically for identity resolution and relationship understanding at massive scale (7B nodes, 11B edges). Unlike Dash (retrieval), UDA (integration), or Zalando (MDM), the Airbnb framing is an OLTP graph workload with strict latency requirements on 4–8 hop traversals for fraud detection and linked-account identification (Source: sources/2026-05-19-airbnb-scaling-identity-graph-unified-knowledge-graph-infrastructure).

Key contrast with other framings:

Dimension	Dash (retrieval)	UDA (integration)	Zalando MDM	Airbnb (identity resolution)
Load	Per-query agent retrieval	Offline schema/pipeline gen	Design-time modeling	Real-time OLTP (4–8 hop traversals)
Scale	Many docs per user	Enterprise domains	Tens of tables	7B nodes, 11B edges, +5M edges/day
Stored as	Flattened bundles	RDF named graphs	Neo4j property graph	JanusGraph + DynamoDB
Key challenge	Relevance + personalization	Model consistency	Business-engineering gap	Long-tail latency on dense subgraphs

The Airbnb instance is the wiki's first canonical disclosure of a knowledge graph as a graph-database-scale OLTP workload with explicit high-fanout latency challenges.

Seen in¶

sources/2025-11-17-dropbox-how-dash-uses-context-engineering-for-smarter-ai — Dash retrieval framing; knowledge graph over unified search index as relevance substrate.
sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash — follow-up with canonical-ID + flattened-bundle details.
sources/2025-06-14-netflix-model-once-represent-everywhere-uda — Netflix UDA enterprise-data-integration framing.
— Zalando MDM data-model-definition framing.
sources/2026-05-04-netflix-democratizing-machine-learning-building-the-model-lifecycle-graph — Netflix MDS as a knowledge graph for ML asset lineage + impact analysis rather than retrieval ranking.
sources/2026-05-19-airbnb-scaling-identity-graph-unified-knowledge-graph-infrastructure — Airbnb Trust & Safety identity-resolution framing; 7B-node identity graph on JanusGraph + DynamoDB with 4–8 hop queries for fraud detection and linked-account identification. Spans six ML source systems (Pipeline Orchestration, Model Registry, Feature Store, Experimentation Platform, Datasets, Identity Platform) into a single graph stored in systems/datomic as reified edges. Relationships are derived in async enrichment jobs via multi-hop walks that materialize derived edges back to the graph. Distinct from the Dropbox Dash retrieval-substrate framing by being purpose-built for ML asset discovery + lineage rather than ranked retrieval.

patterns/precomputed-relevance-graph — the pattern framing: build graph + ranking offline, not at query time.
patterns/canonical-entity-id — entity-resolution primitive that makes the graph's nodes coherent across sources.
systems/dash-search-index — Dash's unified index + graph.
concepts/hybrid-retrieval-bm25-vectors — the retrieval substrate the graph's "knowledge bundles" get fused into.
concepts/ndcg — ranking metric Dash reports graph-derived wins on.
concepts/context-engineering — parent discipline; knowledge graph is one tactic for filtering context to relevance.
systems/netflix-uda · systems/netflix-upper · systems/netflix-sphere — canonical enterprise-data-integration instance.
concepts/named-graph · concepts/rdf · concepts/domain-model · concepts/metamodel · concepts/semantic-interoperability · concepts/data-container — UDA-introduced primitives.
patterns/model-once-represent-everywhere · patterns/graph-walk-sql-generation — UDA-canonicalised patterns.
systems/zalando-mdm-system · systems/neo4j — Zalando MDM instance + chosen graph tool.
concepts/master-data-management · concepts/logical-data-model · concepts/transformation-data-model · concepts/semantic-layer-of-business-concepts — Zalando-MDM-introduced vocabulary.
patterns/knowledge-graph-for-mdm-modeling · patterns/mapping-driven-schema-generation · patterns/visual-graph-for-business-engineering-alignment — Zalando-MDM-canonicalised patterns.

Knowledge graph¶

Named instance in this wiki: Dash¶

Why agents benefit specifically¶

Design implications¶

Distinction from vector-index-only retrieval¶

Seen in¶

Architecture detail: why not a graph database?¶

Second wiki framing: enterprise data-integration substrate (Netflix UDA)¶

Third wiki framing: MDM data-model-definition substrate (Zalando)¶

Fourth wiki framing: Trust & Safety identity-resolution substrate (Airbnb)¶

Seen in¶

Related¶