Skip to content

PATTERN Cited by 1 source

LSM-emulation on B-tree via composite index

Problem

Some index workloads want append-to-value semantics — most canonical example: inserting a new vector into a SPFresh-style posting list that already has thousands of vectors concatenated together as a single blob.

On a B+tree storage engine like InnoDB, appending to a blob is catastrophic:

  • The blob is either inline in the leaf page or stored out-of-tree (InnoDB calls these LOBs — Large blOBS).
  • Either way, the leaf page / LOB layout leaves no tail space — "after each blob there's always … huh, something else" (Source: sources/2026-04-21-planetscale-larger-than-ram-vector-indexes-for-relational-databases).
  • The UPDATE ... SET blob = CONCAT(blob, :new) SQL doesn't append — InnoDB copies the old blob + reserves tail space + writes back a new blob, leaving the old blob to be garbage-collected.
  • The read-then-write is implicitly row-locked for the duration.

LSM storage engines (RocksDB / MyRocks) solve this elegantly via a merge flag on key/value pairs: inserts to the memtable tagged with merge are concatenated (not replaced) during compaction, producing append-to-value with zero read cost and no locking.

But switching storage engines from InnoDB to MyRocks to unlock this property is not an option for a general-purpose database vendor — "the average CRUD app running on PlanetScale just performs better on InnoDB than on MyRocks, and forcing users to switch storage engines … seemed like an unacceptable barrier to adoption."

Solution

Emulate LSM merge-flag semantics on a B-tree using a composite index: make the B-tree key a tuple (logical_id, sequence) where sequence is a strictly-increasing table-local counter.

  • Insert = create a new row with the target logical_id + next sequence. No read of existing data, no row-level lock contention.
  • Read = range scan on the key prefix logical_id; the result is the concatenation of all row values with that prefix, implicitly ordered by sequence.

The composite B-tree key gives you:

  • Fast append. A single tree insert, no blob rewrite.
  • Fast read. Range scans on a B-tree prefix are among the cheapest operations — "B-trees are very good at range scans, because adjacent keys are stored next to each other in the pages."
  • Reuse of the existing transactional substrate. InnoDB MVCC + WAL + crash recovery apply as-is to the appended rows.

The cost shifts from blob-rewrite-on-write to range-scan-on-read. In practice, the range scan is bounded by the posting-list size (which separate background maintenance keeps uniform), so query cost stays roughly constant.

Canonical wiki instance: PlanetScale SPFresh posting

list layout

PlanetScale's 2025-10-01 engineering deep-dive describes the exact mechanism:

"Our composite index uses key a tuple of (head_vector_id, sequence), where sequence is a table-local strictly increasing sequence number. With this, we can approach posting list updates in a very efficient way: we insert a new row into the table with the head_vector_id we're appending to and the next sequence number for the table. This is incredibly friendly to the B-tree compared to the naive approach of looking up the head_vector_id, appending the data to the existing posting list, and then writing it back to the B-tree."

— Vicent Martí, PlanetScale. (Source: sources/2026-04-21-planetscale-larger-than-ram-vector-indexes-for-relational-databases)

The key insight preserves the vector-index- inside-storage-engine choice without switching to an LSM. SPFresh's append-heavy shape (inserts + reassignments both append to posting lists) benefits the most from this emulation.

Trade-offs

  • Row count can grow unbounded. Continuous appends without compaction grow the underlying row count per logical_id, slowly degrading range-scan cost. A companion background defragment / merge pass is required to rewrite posting-list rows periodically into a compact form. PlanetScale calls this SPFresh's Defragment op.
  • No tombstone-for-removal. Removing a vector from the posting list still requires a rewrite — but can be deferred by using version tombstones so the removal itself is also an append. The composite-index + version-tombstone composition is what makes PlanetScale's posting lists fully append-only on the hot path.
  • Sequence counter becomes a contention point if the whole index shares one monotonic counter. Table-local sequences scoped to the vector-index table reduce but do not eliminate this.
  • LSM with merge flag (MyRocks / RocksDB) — the direct inspiration; zero-read-cost appends with atomic compaction. Better write throughput. Worse point-lookup and range-scan performance, which is the reason InnoDB was preferred.
  • Single-value-per-key B-tree (stock UPDATE blob = CONCAT(...)) — the naive approach this pattern replaces. Read-then-write + row lock + full blob copy.
  • Composite index — the generic B-tree feature this pattern weaponises for a specific workload shape.
  • patterns/vector-index-inside-storage-engine — the meta-pattern this composes inside; the LSM-emulation is the missing piece that lets SPFresh-style append workloads live inside InnoDB without switching engines.

Seen in

Last updated · 550 distilled / 1,221 read