Scaling Airbnb's identity graph with a unified knowledge graph infrastructure¶
Summary¶
Airbnb built an internally managed, multi-tenant knowledge graph infrastructure to replace a fragmented landscape of graph workloads (SQL-modeled "graphs," offline-only warehouse graphs, self-managed open-source DBs, third-party PaaS). The platform is built on JanusGraph (Apache TinkerPop / Gremlin) with DynamoDB as the storage backend and OpenSearch for indexing. The identity graph — 7 billion nodes, 11 billion edges, growing at 5 million edges/day — was the first tenant to migrate from a third-party SaaS graph database to this internal platform. The migration delivered significant latency reductions (especially at P95/P99), eliminated periodic manual reboots, and achieved 10× write QPS scalability via auto-scaling.
Key takeaways¶
-
Scale of the identity graph: 7B nodes, 11B edges, ~5M new edges/day. Queries span 4–8 hops for Trust & Safety use cases (fraud detection, linked-account identification).
-
Four anti-patterns before the platform: Teams modeled graphs in SQL (expensive joins), built offline-only warehouse graphs (daily-stale), self-managed open-source graph DBs (high ops toil), or used third-party managed PaaS (vendor lock-in, performance ceilings, limited tunability).
-
Technology choice — JanusGraph + DynamoDB: JanusGraph was chosen for its labeled property graph model, expressive Gremlin queries, pluggable storage backend (enabling DynamoDB for persistence), and open-source codebase extensibility. Storage separation is the key architectural advantage: iterate on graph logic without reinventing distributed storage.
-
Multi-tenant platform with namespace isolation: Each tenant (identity graph, fraud, inventory knowledge graph, data lineage) operates in an isolated namespace. A management service handles schema enforcement, index management, and schematized Thrift APIs.
-
Optimized transactions via DynamoDB conditional writes: Replaced JanusGraph's default heavy locking with a custom transaction strategy using DynamoDB's conditional writes and transaction APIs — lower overhead while preserving data integrity.
-
Parallel multi-slice fetching: Improved JanusGraph's
getMultiSlicesinterface to fetch data in parallel, significantly reducing latency for high-fanout graph queries (the dominant latency driver at 4–8 hops). -
Client-side query rewriting: (a) Removed Gremlin Path/SimplePath steps that fell back to non-batched backend queries — replaced with conditional queries ensuring acyclic results. (b) Restructured side-effect aggregation steps to avoid non-batched substeps in JanusGraph's query planner.
-
Observability gap closed: Integrated Airbnb's distributed tracing into the internal JanusGraph fork, enabling end-to-end request tracing that the open-source version lacked.
-
Migration gains: Performance improvement across all query patterns; significant P99 latency reduction demonstrating long-tail taming; no more periodic manual instance reboots; 10× write QPS via auto-scaling; faster and more transparent incident investigation.
-
Shadow-traffic benchmarking: Both old (vendor) and new (internal) engines supported Gremlin, enabling side-by-side query benchmarking on shadow traffic before cutover.
Operational numbers¶
| Metric | Value |
|---|---|
| Graph nodes | 7 billion |
| Graph edges | 11 billion |
| Daily edge growth | ~5 million/day |
| Typical query hops | 4–8 |
| Write QPS improvement | 10× over vendor solution |
| Vendor reboots eliminated | Periodic → zero |
Architecture¶
┌─────────────────────────────────────────────┐
│ Airbnb Knowledge Graph Platform │
├─────────────────────────────────────────────┤
│ Management Service │
│ (schema enforcement, index mgmt, Thrift) │
├─────────────────────────────────────────────┤
│ JanusGraph Engine (internal fork) │
│ ├─ Custom transaction strategy │
│ ├─ Parallel getMultiSlices │
│ └─ Distributed tracing integration │
├─────────────────────────────────────────────┤
│ Storage: DynamoDB │ Indexing: OpenSearch│
└─────────────────────────────────────────────┘
Data ingestion: near real-time via asynchronous events (write path). Serving: low-latency real-time service calls (read path). Read/write traffic isolation at the graph computation engine layer.
Caveats¶
- No quantitative latency numbers disclosed (only "significant" P99 reduction, "huge improvement").
- Graph query patterns and hop-depth distribution beyond "4–8" not detailed.
- DynamoDB partition key strategy for the graph storage not disclosed.
- OpenSearch indexing strategy (which properties indexed, refresh intervals) not disclosed.
- Auto-scaling configuration (triggers, warmup, cooldown) not disclosed.
- Multi-region / disaster recovery posture not discussed.
- Cost comparison (vendor vs. internal) not disclosed.
- The "client-side query optimization" implies the Gremlin → JanusGraph query planner has known inefficiencies that require application-level workarounds.
- Future plans not detailed beyond "unified knowledge graph infrastructure."
Source¶
- Original: https://medium.com/airbnb-engineering/scaling-airbnbs-identity-graph-with-a-unified-knowledge-graph-infrastructure-ebac467b7836?source=rss----53c7c27702d5---4
- Raw markdown:
raw/airbnb/2026-05-19-scaling-airbnbs-identity-graph-with-a-unified-knowledge-grap-2187acf0.md
Related¶
- concepts/knowledge-graph — this post adds a fourth wiki framing: knowledge graph as Trust & Safety identity-resolution substrate at graph-database scale (7B nodes), contrasting with Dash (retrieval), UDA (integration), and Zalando (MDM modeling)
- concepts/graph-traversal-fanout — 4–8 hop traversals on a dense identity graph are the dominant latency challenge
- concepts/tail-latency-at-scale — high-fanout nodes cause disproportionate P95/P99 growth
- systems/dynamodb — used as JanusGraph's pluggable storage backend
- systems/janusgraph — open-source distributed graph database; Airbnb runs an internal fork with custom optimizations
- concepts/pluggable-storage-backend — JanusGraph's architecture separating graph logic from persistence
- patterns/vendor-to-internal-graph-migration — structured migration using shadow-traffic benchmarking on shared query language (Gremlin)
- patterns/client-side-query-rewriting — application-level Gremlin rewrites to avoid JanusGraph query planner inefficiencies