Skip to content

PATTERN Cited by 1 source

Hybrid tree-graph ANN index

Problem

Pure graph ANN indices (e.g. HNSW) have state-of-the-art recall / latency when the corpus fits in RAM — their random-access edge traversal is memory-bandwidth-limited. Pure partitioning-tree indices scale well but suffer quality loss at recall targets near 100%. Neither shape is ideal when the corpus is larger than RAM and must live on SSD.

Solution

Combine a partitioning tree (organises vectors into posting lists on SSD; supports query-time pruning) with a graph over centroids (supports fast navigation between posting lists without exhaustive tree traversal). Posting lists live on SSD; only the centroid graph + a small working set need to be in RAM.

The tree gives you SSD-scale and locality-based pruning; the graph gives you high-recall navigation between partitions. The composition supports larger-than-RAM corpora without giving up query quality.

Canonical algorithm family: SPANN + SPFresh

  • SPANN (Microsoft Research, NeurIPS 2021) — the hybrid tree + graph design. Partitioning tree
  • centroid graph + SSD-resident posting lists. See concepts/spann-index.

  • SPFresh (Microsoft Research, SOSP 2023) — extension of SPANN adding concurrent background maintenance ops so the index absorbs continuous mutations without rebuilds or recall loss. See concepts/spfresh-index.

PlanetScale's 2024-10-22 vector-beta announcement names SPANN as "a hybrid vector indexing and search algorithm that uses both graph and tree structures, and was specifically designed to work well for larger-than-RAM indexes that require SSD usage." (Source: sources/2024-10-22-planetscale-planetscale-vectors-public-beta.)

Contrast with alternatives

  • HNSW (pure graph, RAM-resident) — best recall / latency if the corpus fits in RAM; rebuild required for sustained mutations. Disqualified for larger-than-RAM + incremental-update workloads.
  • DiskANN (pure graph, SSD-resident) — scales beyond RAM via Vamana graph + in-memory product-quantized vectors; tight SSD-read-budget traversal. Better than HNSW for billion-scale corpora but query latency is worse, and its incremental updates are "not particularly efficient and are hard to map to transactional SQL semantics" (PlanetScale). Appropriate when a read-mostly sidecar vector store is sufficient.

The hybrid tree-graph design (SPANN-family) is the distinctive point of this pattern relative to both pure-graph alternatives.

Why this pattern composes with database storage engines

The tree + posting-list + SSD-residency shape maps cleanly onto a page-oriented storage engine like InnoDB: posting lists become engine-managed pages; the buffer pool caches hot posting lists; page-granularity crash recovery covers vector mutations; sharding layers like Vitess shard the tree. This is the structural reason the SPANN/SPFresh family is the algorithm of choice for patterns/vector-index-inside-storage-engine — pure-graph HNSW and DiskANN don't compose with page-oriented storage engines the same way.

Trade-offs

  • Implementation complexity — two data structures (tree + graph) with cross-maintenance invariants.
  • Write amplification on reorganisations — posting-list splits / merges touch many pages; SPFresh's background ops smooth but don't eliminate this.
  • Parameter tuning — posting-list size, centroid count, tree fanout all affect recall / latency; tuning is non-trivial.

Seen in

Last updated · 319 distilled / 1,201 read