Skip to content

AIRBNB

Read original ↗

Scaling Airbnb's identity graph with a unified knowledge graph infrastructure

Summary

Airbnb built an internally managed, multi-tenant knowledge graph infrastructure to replace a fragmented landscape of graph workloads (SQL-modeled "graphs," offline-only warehouse graphs, self-managed open-source DBs, third-party PaaS). The platform is built on JanusGraph (Apache TinkerPop / Gremlin) with DynamoDB as the storage backend and OpenSearch for indexing. The identity graph — 7 billion nodes, 11 billion edges, growing at 5 million edges/day — was the first tenant to migrate from a third-party SaaS graph database to this internal platform. The migration delivered significant latency reductions (especially at P95/P99), eliminated periodic manual reboots, and achieved 10× write QPS scalability via auto-scaling.

Key takeaways

  1. Scale of the identity graph: 7B nodes, 11B edges, ~5M new edges/day. Queries span 4–8 hops for Trust & Safety use cases (fraud detection, linked-account identification).

  2. Four anti-patterns before the platform: Teams modeled graphs in SQL (expensive joins), built offline-only warehouse graphs (daily-stale), self-managed open-source graph DBs (high ops toil), or used third-party managed PaaS (vendor lock-in, performance ceilings, limited tunability).

  3. Technology choice — JanusGraph + DynamoDB: JanusGraph was chosen for its labeled property graph model, expressive Gremlin queries, pluggable storage backend (enabling DynamoDB for persistence), and open-source codebase extensibility. Storage separation is the key architectural advantage: iterate on graph logic without reinventing distributed storage.

  4. Multi-tenant platform with namespace isolation: Each tenant (identity graph, fraud, inventory knowledge graph, data lineage) operates in an isolated namespace. A management service handles schema enforcement, index management, and schematized Thrift APIs.

  5. Optimized transactions via DynamoDB conditional writes: Replaced JanusGraph's default heavy locking with a custom transaction strategy using DynamoDB's conditional writes and transaction APIs — lower overhead while preserving data integrity.

  6. Parallel multi-slice fetching: Improved JanusGraph's getMultiSlices interface to fetch data in parallel, significantly reducing latency for high-fanout graph queries (the dominant latency driver at 4–8 hops).

  7. Client-side query rewriting: (a) Removed Gremlin Path/SimplePath steps that fell back to non-batched backend queries — replaced with conditional queries ensuring acyclic results. (b) Restructured side-effect aggregation steps to avoid non-batched substeps in JanusGraph's query planner.

  8. Observability gap closed: Integrated Airbnb's distributed tracing into the internal JanusGraph fork, enabling end-to-end request tracing that the open-source version lacked.

  9. Migration gains: Performance improvement across all query patterns; significant P99 latency reduction demonstrating long-tail taming; no more periodic manual instance reboots; 10× write QPS via auto-scaling; faster and more transparent incident investigation.

  10. Shadow-traffic benchmarking: Both old (vendor) and new (internal) engines supported Gremlin, enabling side-by-side query benchmarking on shadow traffic before cutover.

Operational numbers

Metric Value
Graph nodes 7 billion
Graph edges 11 billion
Daily edge growth ~5 million/day
Typical query hops 4–8
Write QPS improvement 10× over vendor solution
Vendor reboots eliminated Periodic → zero

Architecture

┌─────────────────────────────────────────────┐
│       Airbnb Knowledge Graph Platform       │
├─────────────────────────────────────────────┤
│  Management Service                         │
│  (schema enforcement, index mgmt, Thrift)   │
├─────────────────────────────────────────────┤
│  JanusGraph Engine (internal fork)          │
│  ├─ Custom transaction strategy             │
│  ├─ Parallel getMultiSlices                 │
│  └─ Distributed tracing integration        │
├─────────────────────────────────────────────┤
│  Storage: DynamoDB    │  Indexing: OpenSearch│
└─────────────────────────────────────────────┘

Data ingestion: near real-time via asynchronous events (write path). Serving: low-latency real-time service calls (read path). Read/write traffic isolation at the graph computation engine layer.

Caveats

  • No quantitative latency numbers disclosed (only "significant" P99 reduction, "huge improvement").
  • Graph query patterns and hop-depth distribution beyond "4–8" not detailed.
  • DynamoDB partition key strategy for the graph storage not disclosed.
  • OpenSearch indexing strategy (which properties indexed, refresh intervals) not disclosed.
  • Auto-scaling configuration (triggers, warmup, cooldown) not disclosed.
  • Multi-region / disaster recovery posture not discussed.
  • Cost comparison (vendor vs. internal) not disclosed.
  • The "client-side query optimization" implies the Gremlin → JanusGraph query planner has known inefficiencies that require application-level workarounds.
  • Future plans not detailed beyond "unified knowledge graph infrastructure."

Source

Last updated · 542 distilled / 1,571 read