PATTERN Cited by 1 source
Canonical entity ID¶
Canonical entity ID is the pattern of resolving every representation of the same real-world entity (person, document, project, meeting) across integrated source systems into one stable ID — and keying the knowledge graph / retrieval ranker on that ID rather than on per-source identifiers.
Intent¶
Every SaaS app has its own notion of identity. Josh Clemm from Dropbox:
"Every app that we connect with has its own concept or definition of people, and so coming up with a canonical ID for who someone is is very, very impactful for us overall."
(Source: sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash)
Without canonical IDs:
- A query like "Jason's past context engineering talks" has to fan out per-connector with per-app identity lookups — is "Jason" the Google Workspace user, the Slack user, the Jira reporter, the meeting attendee?
- The concepts/knowledge-graph has duplicate nodes per person per source; edges can't cross sources cleanly.
- Ranking signals (activity, collaboration, authorship) fragment across the duplicates.
- User-facing profile views can't aggregate because there's no join key.
With canonical IDs, each person (or doc, or project) is one node in the graph, and all source-system edges terminate on that node.
Mechanism¶
- Connector layer tags each ingested record with source IDs. (Google Docs doc-ID, Confluence page-ID, Slack user-ID, Jira account-ID, …)
- Entity resolution layer. For each entity type (person, doc, project, meeting), apply matching rules:
- Deterministic — shared email address, SSO identifier, directory-sync link.
- Fuzzy — name + domain + title match.
- Behavioral — same activity patterns.
- Canonical ID minted. Often a UUID scoped to the tenant, or a stable hash of the strongest deterministic identifier.
- All edges key on canonical IDs. Source IDs are kept as secondary attributes (for display + back-link) but not used in graph traversal.
Load-bearing for graph-driven retrieval¶
Dash specifically credits canonical-ID people resolution for measurable wins on retrieval quality:
"Say that I want to find all the past context engineering talks from Jason. But who is Jason? How do you know that? Well, if you have this graph—this people model—you can then go ahead and fetch that and add that to the context, and it's not having to do a ton of different retrieval overall… And we use normalized discounted cumulative gain (NDCG) a lot to score the results to retrieve. But just by doing this people-based result we saw some really nice wins."
The wins come from two places:
- Retrieval-side. Graph traversal from "Jason" resolves once, not N times per source.
- Ranking-side. Per-user relevance signals (my team, my manager, my collaborators) require canonical-IDs to be meaningfully computable.
Tradeoffs¶
- Matching is never perfect. False positives merge two people into one node — cross-user ACL leaks. False negatives leave duplicate nodes — degraded ranking, duplicate profile cards. Requires monitoring + reconciliation tooling.
- ACL complexity. The canonical ID touches multiple sources, each with its own access control. Every edge must respect each contributing source's ACLs; a leak on one edge affects the whole graph.
- Cold start. New users / new docs / new projects take ingestion cycles to resolve — the graph is eventually consistent on identity.
- Schema drift per source. When a source system changes its identity model (e.g. moves from email-as-key to SCIM-ID-as-key), the entity-resolution layer must evolve without breaking historical edges.
- User visibility / correction. Users occasionally need to merge / split canonical entities; requires admin tooling.
Applies beyond people¶
Dash names people resolution explicitly, but the same pattern is used for:
- Documents — the "same deck" might live as a Google Slides, a PDF attachment to an email, and a Confluence-embedded copy.
- Meetings — calendar invite + transcript + follow-up doc.
- Projects — Jira epic + Confluence space + Slack channel + shared Drive folder.
Each benefits from the same canonical-ID discipline — one project node linking to its artifacts across sources.
Why it enables the "knowledge bundle" flattening¶
Dash's graph isn't stored in a graph DB; it's built asynchronously and flattened into "knowledge bundles" that are re-ingested through the same index pipeline (Source: sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash). For a bundle to be a meaningful summary around (say) a person or a project, the graph must have one node representing that entity — not per-source duplicates. Canonical entity IDs are the prerequisite for the bundle shape.
Seen in¶
- sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash — Dash's "who is Jason?" example; NDCG-measured wins from people-based ranking; explicit call-out of "canonical ID for who someone is" as load-bearing.
Related¶
- concepts/knowledge-graph — the data model this pattern makes coherent.
- systems/dash-search-index — consumer of canonical-ID-based graph signals in ranking.
- patterns/precomputed-relevance-graph — the production architecture the canonical-ID layer feeds.
- concepts/ndcg — the scoring metric Dash reports wins on after applying this pattern.
- systems/dropbox-dash — production instance.