PlanetScale — Larger than RAM Vector Indexes for Relational Databases¶

Summary¶

Vicent Martí (PlanetScale) presents the engineering-level design of PlanetScale's production vector index inside MySQL / InnoDB, written after two years of development. The article articulates why none of the widely-published ANN index designs (HNSW, SPANN as stated, DiskANN) fit a relational database out of the box, and walks through the novel solutions PlanetScale needed to invent to make an ANN index behave like any other transactional index: a 20% RAM head index (HNSW on centroids) + 80% SSD posting lists inside InnoDB's B+tree, a composite-index LSM emulation on top of B-trees so appending to posting lists avoids full blob rewrites, versioned vectors so reassignment + deletion are append-only, and a WAL-tied head-index mutation path so the in-memory HNSW stays in sync with on-disk postings across crashes. SPFresh's LIRE protocol (Split / Reassign / Merge / Defragment) runs as background jobs bounded to keep posting-list sizes uniform. The core engineering thesis: ANN indexes are approximate only in recall, not in transactional behaviour — committing a thousand vectors means the next SELECT considers all of them; aborting means none of them appear; and the invariants hold across crashes.

Key takeaways¶

Most ANN research assumes RAM residency and static datasets — a non-starter for relational databases. Martí opens with the gap verbatim:

"The majority of existing publications discuss data structures that must fit in RAM — a non starter for any relational database! Many of them expect the indexes to be static, and indexes in a relational database very much are not." (Source: sources/2026-04-21-planetscale-larger-than-ram-vector-indexes-for-relational-databases)

He notes that the few papers on continuous updates "pay no attention to the transactional requirements of such operations, nor how to store them in a reliable and crash resilient way."

PlanetScale ships two vector-index variants. The first is a fully in-memory HNSW with transactional guarantees — 99.9% recall if you accept allocating RAM for the full corpus + "incrementally updating these indexes is very expensive." The second is the larger-than-RAM hybrid index that is the subject of the article:

"we've made a significant effort to design a new kind of hybrid vector index that really behaves like you'd expect an index in a relational database to behave."

The architectural thesis: approximate recall ≠ approximate transactionality. Martí calls out the temptation to cut corners precisely because an ANN result is already approximate, and explicitly rejects it:

"We don't think this is a reasonable approach when implementing a vector index for a relational database. … if you've just committed a thousand vectors, the next SELECT will consider all of them for a similarity query, even if not all of them are returned because the recall is not perfect. This means that if you abort the transaction with the thousand vectors, they will not appear in the index at all. And this means that this all applies always, whether your database is in the process of failing over or recovering from a crash."

20% RAM head index + 80% SSD posting lists, the SPANN-family shape. PlanetScale's index has two layers:
Head index — an in-memory HNSW graph over a random sample of vectors (configurable; default 20%). Each head serves as a centroid.
Posting lists — the remaining 80% of vectors stored as raw binary blobs, one posting list per head, held in InnoDB as values keyed by the head vector's ID.

The K posting-list size "depends on the dimensionality of the vectors; the goal is ensuring that all posting lists are of similar size in bytes and that they're never large enough to slow down InnoDB lookups for their blobs." Randomly sampling rather than clustering centroids is deliberate: "when the dataset is large enough, the law of large numbers ensures that our random sampling is representative." The outcome: the operational index uses ≤30% of the memory the dataset occupies, and only 20% is allocated steady-state.

Explicit improvement over SPANN in centroid-structure choice and clustering. The original SPANN paper proposed BK Trees and K-D Trees for the head index. Martí:

"It is not clear to us why HNSW was not evaluated in the paper, but in our experiments, an optimized implementation of HNSW beats these other data structures in every possible metric (performance, recall, construction time), so our hybrid indexes use HNSW for the head index."

Likewise on clustering: "the SPANN paper discusses a complex clustering algorithm based on BK Trees … after thorough testing with real-life datasets, we concluded that random sampling provides good results in practice."

B-tree blob appends are catastrophic — the LSM- emulation-on-B-tree is the core invention of this article. The naive UPDATE … SET postings = CONCAT(postings, :new) has "atrocious" performance because blobs are either inline or stored out-of-tree as LOBs, and in both cases "adding data to the end of a blob is just not possible, because after each blob there's always … huh, something else." A blob update = copy the old blob + reserve tail space + row-level lock for the read-then-write. PlanetScale's solution:

"our composite index uses key a tuple of (head_vector_id, sequence), where sequence is a table-local strictly increasing sequence number. With this, we can approach posting list updates in a very efficient way: we insert a new row into the table with the head_vector_id we're appending to and the next sequence number for the table."

Lookups become a B-tree range scan on the prefix head_vector_id, and "B-trees are very good at range scans, because adjacent keys are stored next to each other in the pages." This is the canonical composite-index trick applied as an LSM-style append log on top of a B-tree — avoiding the MyRocks storage-engine switch that InnoDB- default PlanetScale users would refuse.

The LSM trick that motivated all of this — merge- flagged key/value pairs. Martí explains why LSM trees (MyRocks / RocksDB) are "uniquely good" for SPANN-style vector inserts:

"you can tag the pair with a special flag (most implementations call it a 'merge' flag) when inserting it into the memtable. … the compaction algorithm can see the 'merge' flag and instead of considering the value for the key as newer, and replacing any existing on-disk values, it just… merges it! … the append no longer needs to remove the old value, and it doesn't need to insert a brand new one either; you just insert the new vector data into the memtable and eventually they'll be merged on disk. Most critically, you don't even need to read the old value at insert time so you can append the new data at the end; the concatenation will be performed atomically by the compaction algorithm. So there's no locking involved during inserts either."

PlanetScale did not switch engines to get this — "the average CRUD app running on PlanetScale just performs better on InnoDB than on MyRocks, and forcing users to switch storage engines … seemed like an unacceptable barrier to adoption."

Edge-adjacent vectors are replicated across neighbour posting lists to preserve recall. "A single vector can land in one or more posting lists, because vectors that are too close to the edge of a cluster will be replicated in neighbouring clusters to increase recall." Insert = ANN-search on the head index for all target posting lists, then append to each.
SPFresh LIRE protocol runs as background jobs — Split, Reassign, Merge, plus PlanetScale's Defragment. Four operations hold the index steady under continuous writes:
Split. Triggered "whenever we detect that a posting list has grown too large after appending new vectors to it." Martí's extension: K-means splits into K clusters where K ≥ 2 based on expected posting-list size, not always 2 as in the LIRE paper. "When the index is under high insertion load, a single posting list can grow very large before its corresponding background job gets to pick it up for splitting, so being able to immediately split it into multiple smaller postings is an important optimization."
Reassign. Splits break the Nearest Partition Assignment (NPA) invariant — some vectors end up in posting lists whose centroid is no longer nearest. SPFresh's mathematical filter: "if the split results in a shorter distance, we know that it is optimal and does not need to be reassigned." Only candidates that now have longer distance to their new head are considered — which "greatly cuts down on the amount of vectors we need to consider."
Merge. Triggered when "a posting list which is large as stored in InnoDB but contains very few actually useful vectors" (due to stale reassigned versions) is observed during a query. Merges nearby small posting lists back into one, recomputing the centroid.
Defragment. "Run when the index is under heavy load: it compacts the underlying rows in InnoDB without removing stale vectors or merging nearby postings, and allows the index to remain performant throughout periods of continuous insertions."
Vector versioning makes reassignment and deletion lock-free appends. "There is no free lunch when it comes to removing data from a posting list" — removing requires reading + rewriting the whole list on either B-tree or LSM. SPFresh's elegant workaround:

"the data inside the posting lists is actually a tuple of (vector_id, version, vector_data), where version is a 1-byte version counter that keeps increasing until it wraps around. The fresh version for each vector in the index is kept in an in-memory table (a table which in practice is tiny because each entry only occupies 1 byte). When we perform a reassignment, we increase the version count of each moved vector by one, and the posting data appended on its new location has the new version count. Any subsequent queries that load the posting list for the old location will see vectors whose version numbers are stale, and these vectors won't be considered when performing distance calculations for the query."

Deletes are flag flips on the versions table: "Deletes are performed by marking a delete flag in the versions table — the same table that is used to mark a vector stale because it's been moved to another posting list." No posting-list rewrite. Stale vectors are eventually compacted by Merge / Split / Defragment.
Updates lift to the SQL-row level, not the index level. "Updates of vector data are, in our experience, a much more rare occurrence, so we've opted to implement them higher in the stack, at the InnoDB level, by transparently updating the vector and the vector ID in the user-facing table that contains the vectors. We then translate this into a delete of the old vector ID and an insert of the new one into the vector index."
The head index + on-disk postings stay in sync via an InnoDB-tied WAL. The HNSW head index lives in memory for performance but must survive crashes. The mechanism is a write-ahead log tied to the InnoDB transaction for each maintenance job:

"To keep the memory index in an always-consistent state, we use a Write Ahead Log tied to the InnoDB transaction for each job. Changes to the HNSW index are stored in a WAL that will eventually be committed together with the changes to the posting lists."

Recovery: "we load the last serialized form of the HNSW index (stored in an on-disk blob) and re-apply all the changes from the InnoDB WAL."
WAL compaction is a background process that runs by pausing all maintenance jobs — user traffic is not affected. The trick is the read-only nature of the head index under normal user traffic:

"selects, inserts, updates and deletes to the vector index do not modify the Head Index — all these operations only require a read-only consistent view of it. Only the background jobs actually perform modifications on the Head Index in order to increase recall and performance. Hence, WAL compaction can be performed by pausing all background jobs in the system while the compaction is running."

User-facing SELECT / INSERT / UPDATE / DELETE continue without contention; any background jobs triggered during compaction queue up.

Operational numbers¶

Head index size at steady state: ~20% of the total vector dataset (configurable; smaller = less memory, worse recall).
Total memory used for an operational index: ≤30% of the dataset size.
HNSW (in-memory variant) recall: 99.9%.
Version counter per vector: 1 byte (wraps around).
OpenAI embedding dimensions cited for context: 1,536 or 3,072 dimensions per vector.
Example dimensions used throughout the article: 3 (for visualisation only).

Architectural diagrams / visual aids¶

The article uses interactive 3D graph animations (not reproducible here) to illustrate HNSW layer descent, posting- list splits breaking the NPA invariant, and reassignments fixing it. There's an LSM-tree flush diagram (PlanetScale image asset) and code blocks showing the naive UPDATE … SET postings = CONCAT(…) SQL.

Caveats / gaps¶

No production performance numbers. The article is architectural; it doesn't report query latency, recall, or write throughput on production workloads. The beta announcement (2024-10-22) and GA announcement (2026-03-25) carry the operational claims — 99.9% recall for in-memory HNSW, 2× query perf + 8× memory efficiency between beta and GA, 6× larger-than-RAM working ceiling, 1-bit quantization, 16,383-dim limit. This deep-dive is their engineering counterpart, not a performance paper.
No disclosure of the InnoDB-WAL coupling mechanism for the HNSW write-ahead log. The text says the HNSW changes "will eventually be committed together with the changes to the posting lists" — the mechanism by which the HNSW WAL is durable before commit (is it written through InnoDB's redo log? a separate log?) is not detailed.
Head-index compaction pause is "trivial in practice" — but pause latency is not quoted. Martí frames it as a non-issue because user traffic doesn't hit the head index for mutations, but the actual pause duration (and whether background work backs up dangerously during it) isn't stated.
K-means clustering cost during high-insert splits is acknowledged but not quantified. The K-way split optimisation is motivated by "when the index is under high insertion load, a single posting list can grow very large before its corresponding background job gets to pick it up" — the concrete threshold and K values aren't cited.
Reassignment heuristic is probabilistic — false negatives possible. The "distance got longer → maybe reassign" check is a necessary-but-not-sufficient filter; the article says "maybe (but not necessarily!) the vector is in a wrong posting list". The full heuristic's recall vs reassignment-cost trade-off isn't numerically reported.
1-byte version counter wraparound risk not explored. 256-version wraparound on a heavily-reassigned vector could in principle alias a stale version as fresh. The article doesn't describe the wraparound-handling protocol.
No mention of Vitess sharding interaction. The beta announcement says the vector index composes with Vitess; this engineering article focuses on single-shard internals and doesn't revisit the sharding layer.
Borrowed LIRE protocol — not invented at PlanetScale. Split + Reassign + Merge are from the SPFresh paper (Microsoft Research, SOSP 2023). PlanetScale's novel contributions in the maintenance layer are (a) K-way split instead of 2-way, (b) the Defragment op ("compacts the underlying rows in InnoDB without removing stale vectors or merging nearby postings"), and (c) the transactional + WAL-coupled integration with InnoDB. The core Split/Reassign/Merge algorithms are not novel.

Cross-source continuity¶

Engineering companion to sources/2026-04-21-planetscale-how-planetscale-keeps-your-data-safe and the prior beta / GA vector announcements. PlanetScale's 2024-10-22 public-beta announcement introduced the SPFresh-in-InnoDB design at marketing altitude; the 2026-03-25 GA post added the 6×-larger- than-RAM operational ceiling and the "hidden InnoDB tables" disclosure. This article is the engineering deep-dive that fills in the how — the composite- index B-tree layout, the versioned-deletion trick, the WAL-tied head-index persistence, the K-way split. The three posts together form the canonical wiki coverage of PlanetScale's vector index.
Aligned with Ben Dicken's B-trees pedagogy post — Martí explicitly links to it for B-tree background: "my friend Ben has a very good post in this same blog called 'B-trees and database indexes' which you should definitely check out." The linked post is the canonical wiki B-tree reference.
SPANN paper cited directly — SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search, NeurIPS 2021. HNSW paper cited directly — arXiv 1603.09320. SPFresh paper referenced as "a follow-up paper from Microsoft research" but not linked inline — Martí uses a text-only [SPFresh](spfresh) hyperlink token, likely an editorial placeholder. The system page carries the full DOI.
LSM / MyRocks / RocksDB framing — Martí treats LSM storage engines as the "dastardly cousin of the B-tree" and names MyRocks as "an open-source LSM-based storage engine for MySQL … developed by Meta." The PlanetScale article functions as wiki-level canonical framing of the LSM-vs-B-tree tradeoff for append-heavy indexes.
Relational-database-as-long-term-substrate thesis — Martí closes: "relational databases are the past, the present and the future of data storage, and they're long overdue for a well tailored solution for vector indexing and approximate nearest neighbor search." This is PlanetScale's positioning wiki-captured across the 2024-10 / 2026-03 / 2025-10 vector corpus.