Skip to content

PLANETSCALE 2024-09-09

Read original ↗

PlanetScale — B-trees and database indexes

Summary

PlanetScale's Ben Dicken publishes a pedagogical deep-dive on B-trees and B+trees as the foundational data structure behind most database indexes — including MySQL's InnoDB, Postgres, MongoDB, and DynamoDB. The article walks from B-tree fundamentals (nodes, keys, ordering, search) through B+tree modifications (leaf-only values, linked-list leaves, MySQL-specific N children instead of N+1), then anchors the theory to InnoDB internals: every MySQL table is a B+tree keyed on its primary key with row data in the leaves (clustered index), secondary indexes are separate B+trees keyed on the indexed column with primary-key values in the leaves (secondary index), and the 16-KB node size matches the InnoDB page size — which is the unit of I/O from disk and the unit of cache in the buffer pool. The core architectural argument of the post: primary-key choice changes the on-disk layout of the whole table, and a random/UUID primary key punishes the B+tree on three axes (unpredictable insert path → more pages visited per insert, unpredictable leaf destination → write amplification from node splits across the tree, out-of-order leaf values → range scans fan out across non-adjacent pages). A sequential BIGINT AUTO_INCREMENT avoids all three: inserts follow the right-most path, leaves grow on the right side, and values sit in insertion-order within neighbouring leaves so time-range queries become sequential I/O.

Key takeaways

  1. A B-tree of order K is a tree where each node stores 1..K key/value pairs, each internal node has ≥K/2 key/value pairs, each node has N+1 children, all leaves are on the same level, and keys within a node are sorted. Search descends one node per level via binary-searching the keys and following the appropriate child pointer — O(log_K N) visits. "The fewer levels it has (or the shallower it is), the faster searching can be performed." (Source: article §"B-tree properties + search".)

  2. B-trees match disk physics. HDDs and SSDs read and write in block units — typically 4 KB, 8 KB, or 16 KB ("each node uses a fixed number of bytes. The number of bytes can be tailored to play nicely with disk blocks"). A single B-tree node sized to one disk block is read in one I/O; a three-level tree with 16-KB nodes and 8-byte keys/values/pointers can store "over 300 million key/value pairs (682 × 682 × 682 = 317,214,568)" — visiting 3 nodes per lookup across 300M rows. This is the disk-block-size alignment argument for B-trees over RAM-optimal structures like red-black trees.

  3. B+trees differ from B-trees in three ways. "Key/value pairs are stored only at the leaf nodes. Non-leaf nodes store only keys and the associated child pointers." Internal nodes fit more keys per node (no value overhead → shallower tree) and leaf nodes form a doubly-linked list (in-order traversal without revisiting internal nodes). MySQL's InnoDB further specialises: non-leaf nodes store N child pointers instead of N+1, and all levels have next/previous pointers so each level is a doubly-linked list. (Source: article §"B+tree".)

  4. InnoDB's clustered index IS the table. "[InnoDB] actually stores all table data in a B+tree, with the table's primary key used as the tree key." The CREATE TABLE primary-key declaration is the key of the B+tree; the values in the leaves are "the remaining column values for each row, and are stored only in the leaf nodes." Narrow tables (few columns) pack hundreds of rows per 16-KB leaf; wide tables may fit single-digit row counts per leaf. This is a clustered index — the primary-key index and the table are the same structure.

  5. Secondary indexes point at primary keys, not rows. "For [secondary indexes], the key is the column(s) that the user selected the index to be built for. The values are the primary key of the associated row." A query that matches a secondary index does two B+tree lookups: one on the secondary B+tree to get the primary key(s), then one on the clustered index per matching primary key to fetch the row. Example:

    CREATE TABLE user (
      user_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
      username VARCHAR(256) NOT NULL,
      email VARCHAR(256) NOT NULL,
      PRIMARY KEY (user_id)
    );
    CREATE INDEX email_index ON user(email);
    
    SELECT username FROM user WHERE email = 'x@planetscale.com' walks email_index to find the user_id, then walks the primary B+tree to fetch the username. This implies SELECT * against a secondary index always does two lookups, and why covering indexes (which include all projected columns in the secondary index itself) are a distinct optimisation.

  6. UUIDv4 primary keys punish B+trees on three axes. A 128-bit UUIDv4 is mostly random, so inserts:

  7. Visit unpredictable nodes per insert: "The nodes visited for an insert are unpredictable ahead of time." Over many inserts, the working set approaches the full tree — every page must be cache-resident or evict-and- refetch. On a cold cache, this multiplies I/O per insert vs the sequential case.
  8. Land in unpredictable leaves: "The destination leaf node for an insert is unpredictable." Node splits happen anywhere in the tree rather than concentrated at the right edge, producing write amplification across many pages.
  9. Store values out of insertion order: "The values in the leaves are not in order." A time-range query (WHERE created > $START AND created < $END) against a random primary key fans out across many non-adjacent leaves — effectively random I/O even though the user query looks sequential. UUIDv3 / v5 (hash-generated) exhibit the same pattern. UUIDv7 (time-ordered) sidesteps most of it — leaves grow on the right like a sequential integer, with random bits only for same-millisecond ordering. (Source: article §"UUIDv4 vs BIGINT primary key".)

  10. BIGINT AUTO_INCREMENT makes inserts right-edge-only. With a sequential integer PK: "We always follow the right-most path when inserting new values. Leaves only get added on the right side of the tree. At the leaf level, data is in sorted order based on when it was inserted." Consequence: "many insertions happening in sequence will revisit the same path of pages, leading to fewer I/O requests when inserting a lot of key/value pairs." The bar chart of "unique nodes visited for the previous 5 inserts" quantifies the asymmetry — random inserts visit more unique nodes per batch, so they require more cache-resident pages and more disk reads on a miss. See patterns/sequential-primary-key.

  11. Time-range queries only sequential-scan with a sequential key. "It's common to search for data from a database in time-sequenced order. Consider viewing the timeline on X, or a chat history in Slack." A ORDER BY sent DESC query between two datetimes on a sequentially-keyed table reads a small number of contiguous leaves — the B+tree leaf linked list makes this a sequential scan. With UUIDv4 the same query fans out; the leaf linked list doesn't help because the matching values aren't in neighbouring leaves.

  12. Key size matters structurally. "We could fit 4 UUIDs (plus 4 child pointers) in each node. … If we had used a BIGINT instead, we could fit 6 keys (and corresponding child pointers) in each node instead. This would lead to a shallower tree, better for performance." UUIDs are 16 bytes (128-bit); BIGINT is 8 bytes. Since B-tree node size is fixed (16 KB in InnoDB), halving key size doubles fan-out per node, reducing tree depth by one for every factor of K in row count. Dicken gives three integer sizing rules: MEDIUMINT for up to 16M rows, INT for up to 4B rows, BIGINT for 18 sextillion — "big enough to never face exhaustion, small enough to not use excessive storage."

  13. The buffer pool is the load-bearing cache. "Whenever it needs to access a piece of data, [InnoDB] loads the entire associated page from disk." The buffer pool is an in-memory cache of InnoDB pages sitting between disk and query execution. "Without it, we'd end up doing significantly more disk I/O operations to handle a query workload." Even with the buffer pool, minimising pages-visited-per-query helps: "there's still a (small) cost to looking up a page in the buffer pool" plus "it helps reduce the number of buffer pool loads and evictions that need to take place." This is the mechanism by which shallow trees + clustered layout + sequential PKs pay off under memory pressure.

Operational numbers

  • Default InnoDB page size: 16 KB (configurable; default since MySQL 5.0 era).
  • Three-level B-tree with 16-KB nodes + 8-byte keys/values/pointers: 682³ ≈ 317 M key/value pairs. (Worked example from article.)
  • Fan-out difference at 100-byte node, 8-byte child pointer, 8-byte value: 4 UUIDs per node vs 6 BIGINTs per node — 50% more fan-out from halving key width.
  • Primary key sizes compared: MEDIUMINT 3 bytes → 16M values; INT 4 bytes → 4B values; BIGINT 8 bytes → 18.4 × 10¹⁸ values; UUID 16 bytes.

Caveats

  • Pedagogical / architecture post — no production retrospective numbers (no buffer-pool-hit-rate percentages, no split percentages observed in PlanetScale fleet, no UUID-vs-BIGINT insert-rate A/B, no latency histograms). The bar-chart datum ("unique nodes visited for the previous 5 inserts") is from the interactive visualisation in the post, not a production sample.
  • MySQL/InnoDB-centric. Postgres, MongoDB, and DynamoDB are named as "rely[ing] on B-trees" but their specific internal variations (e.g. Postgres heap-organised tables with separate PK index vs InnoDB clustered index; MongoDB's WiredTiger B-tree storage engine with document-level locking + MVCC) are not walked. Postgres does not cluster by primary key by default; its heap-organised layout means the primary-key argument here is InnoDB-specific.
  • UUIDv7 mentioned, not detailed. The post points at another PlanetScale article (the-problem-with-using-a-uuid-primary-key-in-mysql) for the UUIDv7 mitigation story; the current post treats it as a forward pointer only.
  • Secondary-index write amplification not quantified. Every INSERT must update every secondary index — each is its own B+tree with its own insert-path / split mechanics. The post names this structurally but doesn't analyse secondary-index proliferation cost.
  • Row-larger-than-16-KB case deferred. "InnoDB also supports rows being larger than a disk block, but we won't dig into that in this post." Off-page storage for large TEXT / BLOB columns is how InnoDB handles this.
  • "Node" vs "page" terminology. The post uses both interchangeably once InnoDB is introduced. Strictly, a node is the tree-structural unit and a page is the I/O unit; in InnoDB these coincide.
  • user.created_at and user.email_address as keys. The post extrapolates to non-PK indexes at the end: created_at behaves like a sequential integer (monotonic inserts), email behaves like random (alphabetical inserts don't match account-creation order). Useful framing for secondary-index choice but unsupported by numbers in this post.

Source

Last updated · 319 distilled / 1,201 read