Skip to content

PLANETSCALE 2022-07-14

Read original ↗

PlanetScale — How do Database Indexes Work?

Summary

A pedagogical explainer by Justin Gage (July 2022) that walks the full mental model of database indexes from first principles — primarily targeting the MySQL / InnoDB reader but explicit on where Postgres diverges. The post develops four load-bearing framings the wiki has previously only captured partially: (1) indexes as "sort by an extra column" — because a table can only be physically sorted by one column (the primary key), indexes are how you buy binary-search efficiency on additional filter columns; (2) heap vs clustered storage as an explicit taxonomic contrast (MS SQL Server / Azure SQL Database even exposes a heap toggle) with the MySQL-vs-Postgres primary-key storage divergence named directly — "In MySQL, primary keys are stored with their data … In Postgres, primary keys are treated like other indexes and stored in separate data structures"; (3) covering index / index-only scan as a distinct read-path optimisation that avoids the clustered-index second fetch — terminology mapped (Postgres: index-only scan; MySQL: covering index); (4) the full taxonomy of MySQL index types — non-unique (KEY), UNIQUE, FULLTEXT (with the MATCH() AGAINST() search modes: natural language / boolean / query expansion), plus prefix and descending indexes as footnotes. Cross-cutting is a write-side cost discipline: indexes take space "precious … you'll need to keep an eye on how close you're inching towards your filesystem's limit", they slow INSERT queries, and they force the query planner to consider more execution plans. The article also names SHOW INDEX and EXPLAIN as the operational inspection tools — EXPLAIN's possible_keys / key output is the canonical way to verify an index is actually being used. Complements the 2024-09-09 Ben Dicken B-tree post (sources/2024-09-09-planetscale-b-trees-and-database-indexes) — same company, same topic, two years earlier, different angle: Gage focuses on what indexes are and how to use them; Dicken focuses on how B+trees actually store them and why primary-key choice matters. Together they form the canonical PlanetScale pedagogy for database indexes on the wiki.

Key takeaways

  1. Indexes buy binary-search efficiency on non-primary-key columns. The worked example is canonical: a users table with hundreds of thousands of rows, a query filtered on most_recent_activity takes 5–6 seconds because without an index the scan is "roughly O(N/2) on average". Creating an index on that column flips the complexity to "O log(N) efficient, and only takes a second or less to run." Gage's framing makes the value proposition quantitative and immediate.

  2. Indexes are "a way of letting your table be sorted by multiple columns." "Data can, of course, only be sorted by one column, which limits the 'binary' efficiency of sorting (with unique values) to a query that filters on that one, ordered column. An index is basically a way of letting your table be sorted by multiple columns, which lets you get binary search efficiency on multiple filter columns." This is the clearest verbal framing in the wiki corpus for why indexes exist — they are secondary sort orders ancillary to the primary physical storage order.

  3. Heap vs clustered storage is an explicit toggle in some engines. "Without an index, data in a database usually gets stored as a heap, basically a pile of unordered, unsorted rows. In fact, this is actually [a setting you can toggle in Microsoft SQL Server and Azure SQL Database]." Canonical taxonomic framing for the heap-organized-vs- clustered split — MS SQL's dual support is the citation that makes this a configurable engine concern rather than a vendor-specific quirk.

  4. MySQL vs Postgres diverge on primary-key storage. "MySQL and Postgres have a pretty major difference between them in how they handle primary key storage. In MySQL, primary keys are stored with their data, which means they effectively take up no extra storage space. In Postgres, primary keys are treated like other indexes and stored in separate data structures." The cleanest one-paragraph statement of the InnoDB-clustered-vs- Postgres-heap divergence that the 2024-09-09 Dicken post uses as its backdrop — Gage canonicalises it here with two short sentences.

  5. Index as a "new pseudo-table with two columns." The mental model: "MySQL goes and creates a new pseudo-table in the database with two columns: a value for most recent activity, and a pointer to the record in the users table. The trick here, though, is that the table gets ordered and stored as a binary tree, with ordered values for that most recent activity column." Narrower than the 2024-09-09 "every index is a separate B+tree" framing (Gage says "binary tree" loosely) but more accessible — the pseudo-table metaphor is useful for readers new to index internals.

  6. Pointer size is small: <5 bytes in MySQL. "Pointers in MySQL are pretty small, usually fewer than 5 bytes." Links to the legendary Stack Overflow post for the full block-count math. This is the index- compactness lever quantified from the pointer-width side — small pointers mean the width of the index pseudo- table is much smaller than the width of the full row, so traversing the index touches fewer disk blocks.

  7. Index-only scan (Postgres) = covering index (MySQL). "An [index only scan] ([covering index] in MySQL) refers to when the value your query is looking for is contained solely in the index, and thus doesn't require a table lookup. In PostgreSQL, not all index types support index-only scans." Canonical terminology-mapping for the same concept across the two engines, and the first wiki page to give the concept a dedicated treatment. The caveat — "not all index types support index-only scans" — is Postgres-specific and flagged as a limitation.

  8. Indexes are a tax on writes. "While they're beneficial for read queries (including JOINs and aggregates), they can negatively impact your database performance as a whole if you're not careful. Indexes take up a lot of space … they also make INSERT queries take longer and force the query engine to consider more options before choosing how to execute a query." Three-part write-side cost framing: storage (index size competes with data for filesystem capacity), write latency (every indexed column updated on every INSERT / UPDATE), planner overhead (each index is a candidate the planner must evaluate).

  9. Index types in MySQL: non-unique (KEY), UNIQUE, FULLTEXT, plus prefix/descending. "You may see 'key' used interchangeably with 'index.' A KEY in MySQL is a type of non-unique index. Because the index isn't unique, scanning through it won't be as efficient as a binary tree, but it will likely be more efficient than linear search. Many columns in your tables will likely fit this bill, e.g. something like 'first name' or 'location.'" Gage notes MySQL supports up to 16 columns per composite index. Unique indexes exist both for binary-search efficiency and as a uniqueness constraint enforcement mechanism — "trying to insert a non-unique value into a column with a unique index on it will throw an error."

  10. FULLTEXT indexes with MATCH() AGAINST() have three search modes. "FULLTEXT qualifier will create a text index in most popular relational databases. It can only be applied to text-type columns (CHAR, VARCHAR, or TEXT) in MySQL. Where things get interesting is using these indexes: MySQL provides a lot of functionality out of the box that starts to resemble what you'd expect from a modern text parsing / NLP library." The three modes: natural language search (default, uses stopwords, no special operators), boolean search ("uses a special query language that's roughly analogous to RegEx (but very different)"), query expansion search ("runs a regular natural language search, then another one using words from the returned rows"). First wiki canonicalisation of the FULLTEXT primitive in MySQL.

  11. CREATE INDEX syntax is feature-rich. Gage's reproduced template shows the full MySQL CREATE INDEX grammar: [UNIQUE | FULLTEXT | SPATIAL], index_type (USING BTREE | HASH), algorithm_option ({DEFAULT | INPLACE | COPY}), lock_option ({DEFAULT | NONE | SHARED | EXCLUSIVE}), key_part with optional ASC | DESC per column, KEY_BLOCK_SIZE, WITH PARSER, COMMENT, {VISIBLE | INVISIBLE}, ENGINE_ATTRIBUTE. Most users stick to a minimal CREATE INDEX name ON table(col) — the rest of the surface is available when needed.

  12. SHOW INDEX FROM table + EXPLAIN are the operational introspection primitives. "You can see any indexes that exist on a particular table with: SHOW INDEX FROM table_name FROM db_name;" For query-side verification: "If you run an EXPLAIN statement on your query, you should also see information about which indexes the query plans to use." Gage reproduces the full EXPLAIN column table, flagging possible_keys (candidate indexes) and key (the index actually chosen) as the fields to check. "This way of using EXPLAIN can also be useful for debugging when creating an index, i.e. verifying that a new one is working as intended."

Operational numbers

  • Pointer size in MySQL secondary indexes: < 5 bytes typically. (Gage, linking to the canonical Stack Overflow post for the full math.)
  • Row count at which a scan starts hurting: hundreds of thousands of users with 5–6-second un-indexed scans vs sub-second indexed scans in the post's worked example.
  • MySQL composite-index column limit: 16. "MySQL actually supports up to 16 columns in an index."
  • Complexity: linear scan O(N/2) on average, binary tree / B-tree scan O log(N).

Caveats

  • Pedagogical, not a production retrospective. No production QPS / latency / storage figures from the PlanetScale fleet. All numbers (O(N/2), 5–6 sec vs sub-second, <5-byte pointers, 16-column composite limit) are reference-level.
  • "Binary tree" is used loosely. Gage says "the table gets ordered and stored as a binary tree, with ordered values" — technically the data structure is a B-tree (or more accurately a B+tree in InnoDB), not a pure binary tree. The abstraction holds at the "O log(N) lookup" level but readers should mentally substitute B-tree / B+tree when reading the MySQL sections. The 2024-09-09 Dicken post (sources/2024-09-09-planetscale-b-trees-and-database-indexes) is the technically-precise counterpart.
  • MySQL/InnoDB-centric with Postgres-divergence call-outs; no coverage of other engines (MongoDB / DynamoDB / SQLite / TiDB). The B-tree → hash-index alternative in CREATE INDEX USING HASH is shown in the grammar but not discussed operationally.
  • FULLTEXT treatment is surface-level. The three search modes are named (natural language / boolean / query expansion), stopwords mentioned, MATCH()/AGAINST() syntax shown, but the underlying inverted-index structure that makes FULLTEXT different from a B-tree index is not walked. For depth on inverted indexes see systems/elasticsearch or systems/lucene.
  • Covering index / index-only scan not contrasted with the non-covering case. The post names the concept and the Postgres-vs-MySQL terminology but doesn't walk a worked example of how the secondary-index leaves are widened to include projected columns.
  • Write-amplification cost is named, not quantified. "Indexes take up a lot of space" and "make INSERT queries take longer" — no ratio (e.g. "N secondary indexes multiply INSERT cost by N+1"), no benchmark.
  • Prefix and descending indexes are mentioned, not walked. The two-paragraph footnote points at the MySQL docs for follow-up. Prefix indexes specifically are the BLOB / TEXT indexing workaround that is canonicalised elsewhere (concepts/blob-text-index-prefix-requirement).
  • Vintage: 2022-07-14. The post predates significant PlanetScale product evolution (Postgres support, Metal, Insights tagging) — treat the MySQL examples as canonical and the Postgres references as historical baseline.

Source

Last updated · 347 distilled / 1,201 read