PATTERN Cited by 1 source

Hybrid tree-graph ANN index¶

Problem¶

Pure graph ANN indices (e.g. HNSW) have state-of-the-art recall / latency when the corpus fits in RAM — their random-access edge traversal is memory-bandwidth-limited. Pure partitioning-tree indices scale well but suffer quality loss at recall targets near 100%. Neither shape is ideal when the corpus is larger than RAM and must live on SSD.

Solution¶

Combine a partitioning tree (organises vectors into posting lists on SSD; supports query-time pruning) with a graph over centroids (supports fast navigation between posting lists without exhaustive tree traversal). Posting lists live on SSD; only the centroid graph + a small working set need to be in RAM.

The tree gives you SSD-scale and locality-based pruning; the graph gives you high-recall navigation between partitions. The composition supports larger-than-RAM corpora without giving up query quality.

Canonical algorithm family: SPANN + SPFresh¶

SPANN (Microsoft Research, NeurIPS 2021) — the hybrid tree + graph design. Partitioning tree
centroid graph + SSD-resident posting lists. See concepts/spann-index.
SPFresh (Microsoft Research, SOSP 2023) — extension of SPANN adding concurrent background maintenance ops so the index absorbs continuous mutations without rebuilds or recall loss. See concepts/spfresh-index.

PlanetScale's 2024-10-22 vector-beta announcement names SPANN as "a hybrid vector indexing and search algorithm that uses both graph and tree structures, and was specifically designed to work well for larger-than-RAM indexes that require SSD usage." (Source: .)

Contrast with alternatives¶

HNSW (pure graph, RAM-resident) — best recall / latency if the corpus fits in RAM; rebuild required for sustained mutations. Disqualified for larger-than-RAM + incremental-update workloads.
DiskANN (pure graph, SSD-resident) — scales beyond RAM via Vamana graph + in-memory product-quantized vectors; tight SSD-read-budget traversal. Better than HNSW for billion-scale corpora but query latency is worse, and its incremental updates are "not particularly efficient and are hard to map to transactional SQL semantics" (PlanetScale). Appropriate when a read-mostly sidecar vector store is sufficient.

The hybrid tree-graph design (SPANN-family) is the distinctive point of this pattern relative to both pure-graph alternatives.

Why this pattern composes with database storage engines¶

The tree + posting-list + SSD-residency shape maps cleanly onto a page-oriented storage engine like InnoDB: posting lists become engine-managed pages; the buffer pool caches hot posting lists; page-granularity crash recovery covers vector mutations; sharding layers like Vitess shard the tree. This is the structural reason the SPANN/SPFresh family is the algorithm of choice for patterns/vector-index-inside-storage-engine — pure-graph HNSW and DiskANN don't compose with page-oriented storage engines the same way.

Trade-offs¶

Implementation complexity — two data structures (tree + graph) with cross-maintenance invariants.
Write amplification on reorganisations — posting-list splits / merges touch many pages; SPFresh's background ops smooth but don't eliminate this.
Parameter tuning — posting-list size, centroid count, tree fanout all affect recall / latency; tuning is non-trivial.

Seen in¶

— PlanetScale's transactional vector index is built on the SPANN + SPFresh family, chosen for its larger-than-RAM
SSD-resident + incrementally-updatable composition.
— GA announcement. Makes a first-RDBMS claim for SPANN and links the original SPANN paper (arXiv 2111.08566). Concrete new architectural datum: "One vector from each partition, around 20% of the index, is stored in a tree structure in memory" — the ~80/20 SSD/RAM split of the hybrid index. The GA prose uses "tree structure" without explicitly re-invoking the centroid graph from the SPANN paper; the algorithmic description here retains the tree+graph framing per Microsoft Research.
sources/2026-04-21-planetscale-larger-than-ram-vector-indexes-for-relational-databases — engineering deep-dive resolves the tree-vs-graph ambiguity. The head index in PlanetScale's implementation is explicitly an HNSW graph over a random 20% sample of centroids, not a pure tree: "our hybrid indexes use HNSW for the head index." Random sampling replaces the SPANN paper's BK-Tree-based clustering — "random sampling provides good results in practice". The pattern's architectural composition with InnoDB requires two mechanisms not in the original SPANN paper: the composite-index LSM emulation for append-cheap posting-list inserts, and the WAL-tied in-memory head-index mutation for crash-safe maintenance ops.