AIRBNB

Scaling Airbnb's identity graph with a unified knowledge graph infrastructure¶

Summary¶

Airbnb built an internally managed, multi-tenant knowledge graph infrastructure to replace a fragmented landscape of graph workloads (SQL-modeled "graphs," offline-only warehouse graphs, self-managed open-source DBs, third-party PaaS). The platform is built on JanusGraph (Apache TinkerPop / Gremlin) with DynamoDB as the storage backend and OpenSearch for indexing. The identity graph — 7 billion nodes, 11 billion edges, growing at 5 million edges/day — was the first tenant to migrate from a third-party SaaS graph database to this internal platform. The migration delivered significant latency reductions (especially at P95/P99), eliminated periodic manual reboots, and achieved 10× write QPS scalability via auto-scaling.

Key takeaways¶

Scale of the identity graph: 7B nodes, 11B edges, ~5M new edges/day. Queries span 4–8 hops for Trust & Safety use cases (fraud detection, linked-account identification).
Four anti-patterns before the platform: Teams modeled graphs in SQL (expensive joins), built offline-only warehouse graphs (daily-stale), self-managed open-source graph DBs (high ops toil), or used third-party managed PaaS (vendor lock-in, performance ceilings, limited tunability).
Technology choice — JanusGraph + DynamoDB: JanusGraph was chosen for its labeled property graph model, expressive Gremlin queries, pluggable storage backend (enabling DynamoDB for persistence), and open-source codebase extensibility. Storage separation is the key architectural advantage: iterate on graph logic without reinventing distributed storage.
Multi-tenant platform with namespace isolation: Each tenant (identity graph, fraud, inventory knowledge graph, data lineage) operates in an isolated namespace. A management service handles schema enforcement, index management, and schematized Thrift APIs.
Optimized transactions via DynamoDB conditional writes: Replaced JanusGraph's default heavy locking with a custom transaction strategy using DynamoDB's conditional writes and transaction APIs — lower overhead while preserving data integrity.
Parallel multi-slice fetching: Improved JanusGraph's getMultiSlices interface to fetch data in parallel, significantly reducing latency for high-fanout graph queries (the dominant latency driver at 4–8 hops).
Client-side query rewriting: (a) Removed Gremlin Path/SimplePath steps that fell back to non-batched backend queries — replaced with conditional queries ensuring acyclic results. (b) Restructured side-effect aggregation steps to avoid non-batched substeps in JanusGraph's query planner.
Observability gap closed: Integrated Airbnb's distributed tracing into the internal JanusGraph fork, enabling end-to-end request tracing that the open-source version lacked.
Migration gains: Performance improvement across all query patterns; significant P99 latency reduction demonstrating long-tail taming; no more periodic manual instance reboots; 10× write QPS via auto-scaling; faster and more transparent incident investigation.
Shadow-traffic benchmarking: Both old (vendor) and new (internal) engines supported Gremlin, enabling side-by-side query benchmarking on shadow traffic before cutover.

Operational numbers¶

Metric	Value
Graph nodes	7 billion
Graph edges	11 billion
Daily edge growth	~5 million/day
Typical query hops	4–8
Write QPS improvement	10× over vendor solution
Vendor reboots eliminated	Periodic → zero

Architecture¶

┌─────────────────────────────────────────────┐
│       Airbnb Knowledge Graph Platform       │
├─────────────────────────────────────────────┤
│  Management Service                         │
│  (schema enforcement, index mgmt, Thrift)   │
├─────────────────────────────────────────────┤
│  JanusGraph Engine (internal fork)          │
│  ├─ Custom transaction strategy             │
│  ├─ Parallel getMultiSlices                 │
│  └─ Distributed tracing integration        │
├─────────────────────────────────────────────┤
│  Storage: DynamoDB    │  Indexing: OpenSearch│
└─────────────────────────────────────────────┘

Data ingestion: near real-time via asynchronous events (write path). Serving: low-latency real-time service calls (read path). Read/write traffic isolation at the graph computation engine layer.

Caveats¶

No quantitative latency numbers disclosed (only "significant" P99 reduction, "huge improvement").
Graph query patterns and hop-depth distribution beyond "4–8" not detailed.
DynamoDB partition key strategy for the graph storage not disclosed.
OpenSearch indexing strategy (which properties indexed, refresh intervals) not disclosed.
Auto-scaling configuration (triggers, warmup, cooldown) not disclosed.
Multi-region / disaster recovery posture not discussed.
Cost comparison (vendor vs. internal) not disclosed.
The "client-side query optimization" implies the Gremlin → JanusGraph query planner has known inefficiencies that require application-level workarounds.
Future plans not detailed beyond "unified knowledge graph infrastructure."

Source¶

concepts/knowledge-graph — this post adds a fourth wiki framing: knowledge graph as Trust & Safety identity-resolution substrate at graph-database scale (7B nodes), contrasting with Dash (retrieval), UDA (integration), and Zalando (MDM modeling)
concepts/graph-traversal-fanout — 4–8 hop traversals on a dense identity graph are the dominant latency challenge
concepts/tail-latency-at-scale — high-fanout nodes cause disproportionate P95/P99 growth
systems/dynamodb — used as JanusGraph's pluggable storage backend
systems/janusgraph — open-source distributed graph database; Airbnb runs an internal fork with custom optimizations
concepts/pluggable-storage-backend — JanusGraph's architecture separating graph logic from persistence
patterns/vendor-to-internal-graph-migration — structured migration using shadow-traffic benchmarking on shared query language (Gremlin)
patterns/client-side-query-rewriting — application-level Gremlin rewrites to avoid JanusGraph query planner inefficiencies