PATTERN Cited by 1 source
Vector index inside storage engine¶
Problem¶
Vector similarity search is increasingly a first-class feature of production systems (semantic search, RAG, recommendation). The default integration shape is a specialised sidecar store — Pinecone, Weaviate, Milvus, Vectorize, S3 Vectors — that lives next to the relational database and exposes its own API.
Sidecars have three structural costs:
- Dual-write consistency. Every mutation has to land in both stores; partial failures leak through.
- Transactional semantics breakdown. The vector index is outside the host database's transaction; rollback + crash recovery don't extend to it.
- Query composition friction.
JOINbetween relational data and vector search has to be expressed at the application layer — the optimiser doesn't see both sides.
Solution¶
Put the vector index inside the host database's storage engine, so the index is a first-class durable structure managed by the engine alongside B+tree rows. The index inherits:
- MVCC / transactional semantics — mutations commit and rollback with the row change.
- Crash recovery — the engine's WAL covers index pages.
- Buffer-pool caching — hot index pages are cached alongside row pages.
- Sharding — the engine's sharding layer shards the vector index too.
- SQL query composition — the optimiser can plan
JOIN/WHERE/ subquery across relational and vector data inside a single query.
This collapses the sidecar's structural costs into one storage substrate.
Canonical wiki instance: PlanetScale vectors¶
PlanetScale's 2024-10-22 vectors public-beta announcement implements this pattern by integrating a transactional extension of SPFresh — the continuously-updatable descendant of SPANN — inside InnoDB:
"For our implementation, we have extended SPFresh by adding transactional support to all its operations and fully integrating it inside InnoDB, MySQL's default storage engine. … Since the indexes are fully managed and stored on-disk by InnoDB, they are always in-sync with the vector data in your tables, they survive process crashes with strong consistency guarantees, they do not need to be periodically rebuilt, and they scale all the way into terabytes, just like any other MySQL table. Together with Vitess, PlanetScale's sharding layer, this allows the construction and efficient querying of huge vector indexes that are fully integrated with all the relational data in your database and can be used with
JOINs andWHEREclauses while the underlying vector data is continuously updated."— sources/2024-10-22-planetscale-planetscale-vectors-public-beta
Structural prerequisites for a hosted ANN algorithm¶
Not every ANN algorithm fits inside a storage engine. The algorithm needs:
- SSD-residency / page-oriented layout — so it maps to the engine's page cache ( buffer pool) rather than assuming full-RAM residency (disqualifies HNSW).
- Continuously-updatable — so mutations can ride the commit path instead of requiring periodic offline rebuilds (disqualifies stock HNSW and SPANN).
- Transactional semantics — or at least a maintenance model that can be retrofitted with transactional semantics without structural redesign (disqualifies DiskANN's incremental-update model per PlanetScale).
SPANN (SSD-resident, posting-list-structured) + SPFresh (continuously-updatable via background maintenance) is the composition that satisfies all three, which is why PlanetScale chose it.
Trade-offs¶
- Implementation effort is substantial. PlanetScale's transactional extension of SPFresh and InnoDB integration is a multi-year engineering programme — the beta announcement flags continued performance work up to GA.
- The host engine's throughput / storage characteristics bound the vector index. If InnoDB's write path is the bottleneck for relational writes, adding vector mutations is additive pressure.
- Not all deployments want coupling. When the vector index is large + mostly read-only + infrequently changed, the dual-write sidecar isn't painful and a specialised vector store may outperform.
Sibling / related patterns¶
- patterns/hybrid-tree-graph-ann-index — the algorithmic pattern (SPANN / SPFresh family) that makes this storage-engine integration viable.
Seen in¶
- sources/2024-10-22-planetscale-planetscale-vectors-public-beta — canonical wiki instance. PlanetScale integrates transactional SPFresh inside InnoDB, composed with Vitess sharding, yielding a first-class vector column with full SQL semantics and ACID.