PATTERN Cited by 1 source
LSM-emulation on B-tree via composite index¶
Problem¶
Some index workloads want append-to-value semantics — most canonical example: inserting a new vector into a SPFresh-style posting list that already has thousands of vectors concatenated together as a single blob.
On a B+tree storage engine like InnoDB, appending to a blob is catastrophic:
- The blob is either inline in the leaf page or stored out-of-tree (InnoDB calls these LOBs — Large blOBS).
- Either way, the leaf page / LOB layout leaves no tail space — "after each blob there's always … huh, something else" (Source: sources/2026-04-21-planetscale-larger-than-ram-vector-indexes-for-relational-databases).
- The
UPDATE ... SET blob = CONCAT(blob, :new)SQL doesn't append — InnoDB copies the old blob + reserves tail space + writes back a new blob, leaving the old blob to be garbage-collected. - The read-then-write is implicitly row-locked for the duration.
LSM storage engines (RocksDB / MyRocks)
solve this elegantly via a merge flag on key/value
pairs: inserts to the memtable tagged with merge are
concatenated (not replaced) during compaction, producing
append-to-value with zero read cost and no locking.
But switching storage engines from InnoDB to MyRocks to unlock this property is not an option for a general-purpose database vendor — "the average CRUD app running on PlanetScale just performs better on InnoDB than on MyRocks, and forcing users to switch storage engines … seemed like an unacceptable barrier to adoption."
Solution¶
Emulate LSM merge-flag semantics on a B-tree using a
composite index: make the B-tree key a tuple (logical_id,
sequence) where sequence is a strictly-increasing
table-local counter.
- Insert = create a new row with the target
logical_id+ nextsequence. No read of existing data, no row-level lock contention. - Read = range scan on the key prefix
logical_id; the result is the concatenation of all row values with that prefix, implicitly ordered bysequence.
The composite B-tree key gives you:
- Fast append. A single tree insert, no blob rewrite.
- Fast read. Range scans on a B-tree prefix are among the cheapest operations — "B-trees are very good at range scans, because adjacent keys are stored next to each other in the pages."
- Reuse of the existing transactional substrate. InnoDB MVCC + WAL + crash recovery apply as-is to the appended rows.
The cost shifts from blob-rewrite-on-write to range-scan-on-read. In practice, the range scan is bounded by the posting-list size (which separate background maintenance keeps uniform), so query cost stays roughly constant.
Canonical wiki instance: PlanetScale SPFresh posting¶
list layout
PlanetScale's 2025-10-01 engineering deep-dive describes the exact mechanism:
"Our composite index uses key a tuple of
(head_vector_id, sequence), wheresequenceis a table-local strictly increasing sequence number. With this, we can approach posting list updates in a very efficient way: we insert a new row into the table with thehead_vector_idwe're appending to and the next sequence number for the table. This is incredibly friendly to the B-tree compared to the naive approach of looking up thehead_vector_id, appending the data to the existing posting list, and then writing it back to the B-tree."— Vicent Martí, PlanetScale. (Source: sources/2026-04-21-planetscale-larger-than-ram-vector-indexes-for-relational-databases)
The key insight preserves the vector-index- inside-storage-engine choice without switching to an LSM. SPFresh's append-heavy shape (inserts + reassignments both append to posting lists) benefits the most from this emulation.
Trade-offs¶
- Row count can grow unbounded. Continuous appends
without compaction grow the underlying row count per
logical_id, slowly degrading range-scan cost. A companion background defragment / merge pass is required to rewrite posting-list rows periodically into a compact form. PlanetScale calls this SPFresh's Defragment op. - No tombstone-for-removal. Removing a vector from the posting list still requires a rewrite — but can be deferred by using version tombstones so the removal itself is also an append. The composite-index + version-tombstone composition is what makes PlanetScale's posting lists fully append-only on the hot path.
- Sequence counter becomes a contention point if the whole index shares one monotonic counter. Table-local sequences scoped to the vector-index table reduce but do not eliminate this.
Related / contrast¶
- LSM with merge flag (MyRocks / RocksDB) — the direct inspiration; zero-read-cost appends with atomic compaction. Better write throughput. Worse point-lookup and range-scan performance, which is the reason InnoDB was preferred.
- Single-value-per-key B-tree (stock
UPDATE blob = CONCAT(...)) — the naive approach this pattern replaces. Read-then-write + row lock + full blob copy. - Composite index — the generic B-tree feature this pattern weaponises for a specific workload shape.
- patterns/vector-index-inside-storage-engine — the meta-pattern this composes inside; the LSM-emulation is the missing piece that lets SPFresh-style append workloads live inside InnoDB without switching engines.
Seen in¶
- sources/2026-04-21-planetscale-larger-than-ram-vector-indexes-for-relational-databases — canonical wiki source. Vicent Martí's PlanetScale engineering deep-dive names the composite-index-on-B-tree trick as the core invention that avoids the MyRocks migration.