Skip to content

PLANETSCALE 2024-10-22

Read original ↗

PlanetScale — Announcing the PlanetScale vectors public beta

Summary

PlanetScale announces open beta of vector search + storage in its MySQL-fork product — vector columns + indexes live inside the MySQL schema alongside relational data, queryable via full SQL (JOIN, WHERE, subqueries) with ACID compliance, pre- and post-filtering, and transactional mutation of the vector index as part of the SQL commit. Implementation is anchored on two Microsoft Research papers: SPANN (Space-Partitioned Approximate Nearest Neighbors — a hybrid tree + graph ANN index designed for larger-than-RAM indexes with SSD residency) and SPFresh (a set of concurrent background maintenance operations on top of SPANN that allow the index to be continuously updated without losing recall or query performance). PlanetScale extends SPFresh with transactional semantics and integrates it inside InnoDB (MySQL's default storage engine). Indexes are stored and managed on-disk by InnoDB — they stay in sync with the table, survive crashes with strong consistency, never need periodic rebuilds, and scale into terabytes. The post rejects both HNSW ("very good query performance, but struggles to scale because it needs to fit its whole dataset in RAM" — and its indexes "cannot be updated incrementally, so they require periodically re-building the index") and DiskANN (scales well but "suffers from worse query performance", and its incremental updates "are not particularly efficient and are hard to map to transactional SQL semantics") as inadequate for a general-purpose relational database. Vector support is enabled at the branch level via beta enrolment.

Key takeaways

  1. Vector storage + relational storage collapse into one MySQL-protocol substrate. PlanetScale's thesis: "you can store your vector data alongside your application's relational MySQL data — eliminating the need for a separate specialized vector database." A canonical wiki instance of the vector-column-in-an-OLTP-database architectural direction (sibling of pgvector-on-Postgres), as opposed to the specialised-sidecar-store direction (Pinecone, Weaviate, Milvus, Cloudflare Vectorize, S3 Vectors). (Source: verbatim announcement thesis.)

  2. HNSW and DiskANN rejected on structural grounds. PlanetScale's public reasoning: "HNSW has very good query performance, but struggles to scale because it needs to fit its whole dataset in RAM. Most importantly, HNSW indexes cannot be updated incrementally, so they require periodically re-building the index with the underlying vector data. This is just not a good fit for a relational database. DiskANN scales well, but suffers from worse query performance, and while it can be modified to allow incremental updates, these are not particularly efficient and are hard to map to transactional SQL semantics." First canonical wiki statement of the incremental-update requirement + transactional-SQL-semantics requirement as the two axes that disqualify the two mainstream ANN families for use inside a relational engine.

  3. SPANN: hybrid tree + graph, SSD-resident by design. SPANN is "a hybrid vector indexing and search algorithm that uses both graph and tree structures, and was specifically designed to work well for larger-than-RAM indexes that require SSD usage." The design target — larger-than-RAM-with-SSD-residency — is exactly the problem shape HNSW struggles with and the problem shape any general-purpose relational database will encounter as soon as the vector column grows beyond buffer-pool capacity. See systems/spann.

  4. SPFresh: concurrent background maintenance on top of SPANN. SPFresh extends SPANN with "a set of concurrent background maintenance operations that allow the index to be continuously updated without losing recall or query performance." The key architectural property: the index is continuously updated rather than periodically rebuilt — consistent with OLTP write cadence. See systems/spfresh.

  5. PlanetScale's extension: transactional SPFresh inside InnoDB. "For our implementation, we have extended SPFresh by adding transactional support to all its operations and fully integrating it inside InnoDB, MySQL's default storage engine. This means that inserts, updates, and deletes of vector data are immediately reflected in the vector index as part of committing your SQL transaction, and follow the same transactional semantics, including support for batch commits and rollbacks." Canonical wiki instance of patterns/vector-index-inside-storage-engine — the vector index is a first-class durable structure owned by the storage engine, not a sidecar.

  6. Three derived correctness / operability properties. Because SPFresh is integrated inside InnoDB: (a) indexes are fully managed and stored on-disk by InnoDB, so they are "always in-sync with the vector data in your tables"; (b) they "survive process crashes with strong consistency guarantees"; (c) they "do not need to be periodically rebuilt" and "scale all the way into terabytes, just like any other MySQL table." The operational wins are direct consequences of the architectural choice in takeaway 5.

  7. Vitess + InnoDB together enable sharded vector indexes. "Together with Vitess, PlanetScale's sharding layer, this allows the construction and efficient querying of huge vector indexes that are fully integrated with all the relational data in your database and can be used with JOINs and WHERE clauses while the underlying vector data is continuously updated." First wiki datum on sharded transactional vector indexes composed from a MySQL-protocol storage substrate (InnoDB hosting SPFresh) + a horizontal-sharding layer (Vitess).

  8. Full SQL feature set is retained. Named capabilities: pre-filtering + post-filtering, full SQL syntax including JOIN, WHERE, and subqueries, ACID compliance. The post frames each as a deliberate correctness target "that our base implementation checks all of." Significant because many specialised vector stores expose only a subset (typically post-filter-only, top-K-with-metadata-filter-only, eventually-consistent writes).

  9. Beta enrolment is per-branch. PlanetScale's branching model (schema / data branches for safe deployment) means vector support is enabled on a specific branch rather than cluster-wide — consistent with the broader branch model PlanetScale uses for schema changes.

Architectural numbers

  • Scale target: "scale all the way into terabytes, just like any other MySQL table." No per-index or per-vector concrete numbers disclosed at beta.
  • Node size / dimension / recall percentiles: not disclosed.
  • Build / query latency: not disclosed.
  • Memory footprint: not disclosed (beyond the larger-than-RAM design claim).
  • Target hardware: SSD implicit (SPANN design).

Systems extracted

  • systems/planetscale — product surface (branch-level beta enrolment, MySQL-protocol frontend).
  • systems/vitess — sharding layer, composes with transactional SPFresh indexes for sharded vector search.
  • systems/mysql — MySQL wire protocol + SQL surface.
  • systems/innodb — the storage engine that owns the vector index on disk, provides MVCC + crash recovery + buffer pool to the SPFresh integration.
  • systems/spann — hybrid tree + graph ANN index from Microsoft Research designed for SSD-resident larger-than-RAM corpora.
  • systems/spfresh — concurrent background maintenance layer on top of SPANN.
  • systems/hnsw — rejected alternative (RAM-bound, no incremental update).
  • systems/diskann — rejected alternative (worse query latency, hard to map to transactional SQL).

Concepts extracted

Patterns extracted

  • patterns/vector-index-inside-storage-engine — new wiki pattern: put the ANN index inside the durable engine alongside B+tree / row storage, so it inherits MVCC, crash recovery, buffer-pool caching, sharding, and SQL planning from the host engine instead of reinventing each.
  • patterns/hybrid-tree-graph-ann-index — new wiki pattern (algorithmic): combine a partitioning tree (for SSD residency / pruning) with a graph (for local search quality) to get SSD-scale vector indexes that still preserve recall.

Caveats / gaps

  • Announcement voice, not production retrospective — no production scale numbers (query QPS, p99 latency, corpus size, dimension, index build times, recall at K, GPU / CPU / memory budget).
  • Beta, not GA — PlanetScale explicitly flags "we will continue to improve performance leading up to GA." Current implementation reserves the right to change.
  • No head-to-head benchmarks disclosed — the HNSW / DiskANN rejection is structural (RAM-residency, incrementality, transactional-semantics compatibility) rather than empirical; no numbers shown comparing PlanetScale's SPFresh integration vs. HNSW or DiskANN on any workload.
  • Transactional SPFresh details not disclosed — the post names "adding transactional support to all its operations" but does not specify: how are concurrent SPFresh background maintenance operations serialised against user SQL transactions? what is the consistency semantics of an in-flight rollback on a partial SPFresh update? what happens when buffer-pool eviction collides with a SPFresh split/merge? All deferred.
  • No disclosed dimension ceiling / metric-support table — which distance metrics (cosine / L2 / inner product / Hamming) are supported, and what the dimension ceiling is, are not named in the post.
  • No pricing / SKU disclosure — beta enrolment is free- form; GA pricing not mentioned.
  • No integration story for ML toolchains — embedding generation (Transformers / CLIP / OpenAI) is the producer side; the post is silent on whether PlanetScale provides embedding-generation conveniences or leaves it to the app.
  • Vitess sharding + SPFresh interaction glossed over — how does a SPFresh index sharded across Vitess tablets handle cross-shard top-K queries? Scatter-gather with per-shard top-K and merge? Any cross-shard recall degradation? Not disclosed.

Source

Last updated · 319 distilled / 1,201 read