Skip to content

CONCEPT Cited by 1 source

Adjacency list storage

Adjacency list storage represents a graph by associating each source node with the list of its out-edges (or in-edges) — the direct opposite of an adjacency matrix that would represent edges as a sparse |V|×|V| boolean grid. For graph databases running over key-value substrates, adjacency-list storage is the canonical layout: the source node id becomes the partition key, and the connected target node ids are sorted clustering items within the partition.

On a KV substrate

If the underlying KV layer offers a HashMap<String, SortedMap<Bytes, Bytes>> shape (Netflix KV Abstraction does, and Cassandra's (partition key, clustering column) model does), the adjacency list maps onto the substrate without translation:

KV namespace = edge-link namespace (per direction, per edge type)
  partition key  : source-node-id
  clustering key : target-node-id (lex-sorted by default)
  value          : edge metadata (last_write_time, etc.)

A traversal "give me all edges from source S" becomes a single range read on partition SO(neighbours(S)) rather than the O(|E|) full scan the matrix layout would force.

Direction matters

Adjacency-list storage is intrinsically directional — partitioning by source node gives fast forward traversal but no reverse traversal. Two solutions in practice:

  1. Bidirectional edges — write the link twice, once per direction, so the link namespace contains both forward and reverse pairs (the forward + reverse edge index pattern).
  2. Reverse-only index — for unidirectional edge types, an auxiliary reverse-direction adjacency list provides reverse traversal at the cost of a second write per edge.

Netflix Graph Abstraction does both: every edge link is written to both forward and reverse adjacency lists at write time (patterns/forward-and-reverse-adjacency-index), so any traversal is a single range read on the appropriate partition.

Wide-row hazard

The adjacency list of a high-fanout node ("hub") can grow to millions of entries. On databases like Cassandra this triggers wide-row problems:

  • Partition size limits — Cassandra recommends partitions stay below ~100 MB for predictable performance; a hub with millions of edges blows past this.
  • Read amplification — a range read on a wide row touches many SSTables; latency degrades with partition size.

Netflix mitigates this by separating edge links from edge properties (patterns/separate-edge-links-from-properties) — the link namespace stays narrow because each link record is small (just the link identity + last-write timestamp), while property bags live in a separate property namespace. Verbatim: "Decoupling edge links from their properties prevents large partitions in databases like Cassandra, enabling efficient storage and low-latency reads — even for edges with millions of connections." (Source: sources/2026-05-29-netflix-high-throughput-graph-abstraction-at-netflix-part-i)

Even with the split, the post names a per-source-node edge-count bound — "to ensure optimal performance without exerting too much memory pressure, we aim to limit the number of edges per source node within the system" — as an operational invariant for the latest-sort fetch path (which sorts in memory). The specific cap is not disclosed.

Sort orders inside the adjacency list

Two canonical sort orders for the clustering items:

  • Lex-sorted by target node id (default in Netflix Graph Abstraction) — supports range queries and pagination without any post-processing.
  • Time-sorted by last-write timestamp — supports "latest connections" queries; requires either a separate index, an explicit secondary clustering column, or in-memory sort after fetch.

Netflix uses default lex order on disk + in-memory sort by last-write time for latest-connections traversals, with the edge-count bound above to keep the in-memory sort affordable.

Seen in

Last updated · 542 distilled / 1,571 read