Skip to content

SYSTEM Cited by 1 source

SPANN

SPANN (Space-Partitioned Approximate Nearest Neighbors) is a hybrid tree + graph ANN index from Microsoft Research, specifically designed for larger-than-RAM corpora that require SSD residency. Published at NeurIPS 2021 (arXiv 2111.08566).

Architectural shape

  • Partitioning tree — vectors are partitioned into many posting lists (clusters) organised by a tree index. At query time, the tree is used to identify the small set of posting lists most relevant to the query vector.
  • Graph on centroids — the centroids themselves form a graph that supports fast nearest-list navigation without exhaustive tree traversal.
  • SSD residency — posting lists sit on SSD; only the centroid graph + a small working set are kept in RAM. This is the defining architectural property: the corpus does not need to fit in RAM, which decisively breaks with pure graph ANN indices like HNSW.

Why it matters for databases

The design target — larger-than-RAM-with-SSD-residency — matches exactly the problem shape a general-purpose relational database encounters as soon as a vector column grows beyond buffer-pool capacity. PlanetScale calls this out verbatim: "SPANN is a hybrid vector indexing and search algorithm that uses both graph and tree structures, and was specifically designed to work well for larger-than-RAM indexes that require SSD usage." (Source: .)

The InnoDB storage engine already has a page-level SSD-resident substrate (B+tree pages, buffer pool caching, write-ahead log, crash recovery). SPANN's posting-list-on-SSD model composes naturally with that substrate — the posting lists become InnoDB-managed pages.

Limitations of stock SPANN

Stock SPANN is not incrementally updatable as published — posting-list reorganisations (split / merge / rebalance) are done offline. Continuous updates are the job of SPFresh, a follow-up paper that extends SPANN with concurrent background maintenance.

Seen in

  • — PlanetScale's public-beta announcement names SPANN as the algorithmic foundation for its transactional vector index inside InnoDB. The actual implementation is PlanetScale's extension of SPFresh (the SPANN-plus- maintenance follow-up), integrated into InnoDB with transactional semantics added to all SPFresh operations.

  • GA announcement (2026-03-25), six months after beta. Makes an unambiguous first-RDBMS claim for SPANN: "We are the first to incorporate an index based on SPANN (Space-Partitioned Approximate Nearest Neighbors) into an RDBMS." Links the original SPANN paper (arXiv 2111.08566) directly. New architectural detail: "vectors are assigned to small partitions, which are stored in hidden InnoDB tables. One vector from each partition, around 20% of the index, is stored in a tree structure in memory." First wiki datum on SPANN's ~80/20 RAM/SSD split in PlanetScale's implementation. GA also discloses the operational ceiling — the index "performs well even when … 6× larger than available memory" (see concepts/larger-than-ram-vector-index) — and that SPANN partition I/O is the specific workload PlanetScale Metal's direct-attached NVMe substrate accelerates. Minor ambiguity: GA prose says "stored in a tree structure in memory" without re-invoking the centroid graph from the SPANN paper — unclear whether PlanetScale's implementation uses pure tree or tree+graph over centroids. The algorithmic description here retains the SPANN-paper tree+graph framing per Microsoft Research.

  • sources/2026-04-21-planetscale-larger-than-ram-vector-indexes-for-relational-databasesengineering deep-dive resolves the GA ambiguity. Vicent Martí confirms the head index is HNSW over centroids (explicit departure from the paper's BK / K-D trees): "an optimized implementation of HNSW beats these other data structures in every possible metric." The ~80/20 split in memory use is quantified via the 30% operational memory ceiling (concepts/larger-than-ram-vector-index). Additional critiques of the paper that PlanetScale resolved: (a) SPANN's complex clustering algorithm replaced with random sampling"the law of large numbers ensures that our random sampling is representative"; (b) SPANN's insert path assumed LSM-backed storage; PlanetScale's InnoDB integration requires the composite-index LSM emulation to preserve append-only insert semantics on a B-tree.

Last updated · 542 distilled / 1,571 read