Skip to content

PATTERN Cited by 1 source

Vector index inside storage engine

Problem

Vector similarity search is increasingly a first-class feature of production systems (semantic search, RAG, recommendation). The default integration shape is a specialised sidecar store — Pinecone, Weaviate, Milvus, Vectorize, S3 Vectors — that lives next to the relational database and exposes its own API.

Sidecars have three structural costs:

  1. Dual-write consistency. Every mutation has to land in both stores; partial failures leak through.
  2. Transactional semantics breakdown. The vector index is outside the host database's transaction; rollback + crash recovery don't extend to it.
  3. Query composition friction. JOIN between relational data and vector search has to be expressed at the application layer — the optimiser doesn't see both sides.

Solution

Put the vector index inside the host database's storage engine, so the index is a first-class durable structure managed by the engine alongside B+tree rows. The index inherits:

  • MVCC / transactional semantics — mutations commit and rollback with the row change.
  • Crash recovery — the engine's WAL covers index pages.
  • Buffer-pool caching — hot index pages are cached alongside row pages.
  • Sharding — the engine's sharding layer shards the vector index too.
  • SQL query composition — the optimiser can plan JOIN / WHERE / subquery across relational and vector data inside a single query.

This collapses the sidecar's structural costs into one storage substrate.

Canonical wiki instance: PlanetScale vectors

PlanetScale's 2024-10-22 vectors public-beta announcement implements this pattern by integrating a transactional extension of SPFresh — the continuously-updatable descendant of SPANN — inside InnoDB:

"For our implementation, we have extended SPFresh by adding transactional support to all its operations and fully integrating it inside InnoDB, MySQL's default storage engine. … Since the indexes are fully managed and stored on-disk by InnoDB, they are always in-sync with the vector data in your tables, they survive process crashes with strong consistency guarantees, they do not need to be periodically rebuilt, and they scale all the way into terabytes, just like any other MySQL table. Together with Vitess, PlanetScale's sharding layer, this allows the construction and efficient querying of huge vector indexes that are fully integrated with all the relational data in your database and can be used with JOINs and WHERE clauses while the underlying vector data is continuously updated."

sources/2024-10-22-planetscale-planetscale-vectors-public-beta

Structural prerequisites for a hosted ANN algorithm

Not every ANN algorithm fits inside a storage engine. The algorithm needs:

  1. SSD-residency / page-oriented layout — so it maps to the engine's page cache ( buffer pool) rather than assuming full-RAM residency (disqualifies HNSW).
  2. Continuously-updatable — so mutations can ride the commit path instead of requiring periodic offline rebuilds (disqualifies stock HNSW and SPANN).
  3. Transactional semantics — or at least a maintenance model that can be retrofitted with transactional semantics without structural redesign (disqualifies DiskANN's incremental-update model per PlanetScale).

SPANN (SSD-resident, posting-list-structured) + SPFresh (continuously-updatable via background maintenance) is the composition that satisfies all three, which is why PlanetScale chose it.

Trade-offs

  • Implementation effort is substantial. PlanetScale's transactional extension of SPFresh and InnoDB integration is a multi-year engineering programme — the beta announcement flags continued performance work up to GA.
  • The host engine's throughput / storage characteristics bound the vector index. If InnoDB's write path is the bottleneck for relational writes, adding vector mutations is additive pressure.
  • Not all deployments want coupling. When the vector index is large + mostly read-only + infrequently changed, the dual-write sidecar isn't painful and a specialised vector store may outperform.

Seen in

Last updated · 319 distilled / 1,201 read