CONCEPT Cited by 2 sources

Secondary index¶

A secondary index is a database index on a non-primary-key column (or column set). In a system with a clustered index like MySQL's InnoDB, a secondary index is itself a B+tree — keyed on the indexed column, with primary-key values stored in its leaves:

The key is the column(s) that the user selected the index to be built for. The values are the primary key of the associated row.

—

Two-step lookup¶

Every query that matches a secondary index does two B+tree walks in a clustered-index system:

Walk the secondary B+tree with the predicate value to find matching primary key(s).
Walk the clustered index (the table itself) with each primary key to fetch the full row.

Worked example from the article:

CREATE TABLE user (
  user_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
  username VARCHAR(256) NOT NULL,
  email VARCHAR(256) NOT NULL,
  PRIMARY KEY (user_id)
);
CREATE INDEX email_index ON user(email);

-- Query:
SELECT username FROM user WHERE email = 'x@planetscale.com';

The query engine first walks email_index (a separate B+tree keyed on email with user_id values), then walks the primary-key B+tree with the user_id to fetch the row and project username. Two B+tree traversals per matching row.

Covering indexes avoid the second walk¶

A covering index is a secondary index that includes all columns the query projects, stored inside the secondary-index leaves. Queries satisfied entirely from the secondary index skip the clustered-index lookup entirely. The trade-off:

Read: faster — one B+tree walk instead of two.
Write: slower — every update to a covered column must update the secondary index.
Storage: larger — secondary index now holds more data.

InnoDB covering indexes are typically expressed as composite indexes: CREATE INDEX cover ON user(email, username).

Write amplification¶

Every INSERT / UPDATE of an indexed column writes to the clustered index and every secondary index. Each secondary index is a separate B+tree with its own insert path and split mechanics. Costs that grow with secondary-index count:

Insert I/O: each index walked + modified per row.
Buffer-pool pressure: each index's working set of hot pages competes for cache.
Lock contention: each index has its own locks.
Storage: each index is a separate B+tree.

Practical rule: secondary indexes aren't free — every one added is a tax on every write to that table.

Primary-key size cascades into secondary-index size¶

Because every secondary index leaf stores the primary-key value, the PK choice affects the size of every secondary index:

Primary key	Leaf entry size (index + PK pointer)
8-byte `BIGINT`	Indexed column + 8 bytes
16-byte UUID	Indexed column + 16 bytes

For a table with 5 secondary indexes, switching the PK from BIGINT to UUID adds 8 bytes × 5 indexes × row count to total storage — before counting fan-out reductions from larger key sizes.

Seen in¶

— Justin Gage (PlanetScale, 2022-07-14) canonicalises the full mental model of secondary indexes from first principles at pedagogy-101 altitude: indexes as "sort-by-additional-column" (because a table can physically be sorted by only one column, the PK, indexes buy binary-search efficiency on other filter columns), the "pseudo-table with two columns" mental model (indexed value + PK pointer), the heap-vs-clustered taxonomic split (MS SQL / Azure SQL Database even expose it as a toggle), the MySQL-vs-Postgres PK-storage divergence ("In MySQL, primary keys are stored with their data … In Postgres, primary keys are treated like other indexes"), the covering-index / index-only- scan optimisation mapped across both engines, and the three-part write-side cost discipline (storage / write latency / planner overhead). Pointer size quantified: <5 bytes in MySQL secondary indexes. Worked complexity numbers: linear scan O(N/2) → B-tree scan O log(N); worked example: 300K-row table, 5–6-sec unindexed scan → sub-second indexed scan. Full MySQL index-type taxonomy canonicalised — non-unique (KEY), UNIQUE, FULLTEXT (with three MATCH() AGAINST() modes: natural-language / boolean / query-expansion), plus prefix / descending as footnotes. Composite-index column limit: 16. Operational introspection canonicalised: SHOW INDEX FROM table (enumerate existing indexes) + EXPLAIN (verify index use via possible_keys / key) → new patterns/explain-driven-index-verification pattern. Aaron-Francis-style companion to Dicken's 2024-09-09 B-trees post — Gage (2022) teaches what indexes are and how to use them; Dicken (2024) teaches how B+trees actually store them. Together they are the canonical PlanetScale pedagogy for database indexes on the wiki.
— JD Lien (PlanetScale, 2023-02-17) canonicalises the pedagogy-altitude trade-off catalogue for secondary indexes: three first-order costs (storage, write amplification, query-planner complexity) paired with a two-step safe-drop workflow — (1) detect candidates via the information_schema.STATISTICS cardinality heuristic and (2) validate each candidate via ALTER INDEX ... INVISIBLE before the destructive DROP INDEX. Canonical pedagogical anchor "Every index takes up additional storage, can slow down write operations, and can complicate the query optimizer's job, so they aren't always guaranteed to improve performance." Worked operational number: a 1M-row bulk insert took 10-15s without indexes → ~2 minutes after adding several — the ~8-12× write-amplification tax quantified on a single production path. Composes with Dicken's 2024-09-09 (structural mechanism of write amplification) and Hazen's 2024-08-14

(production-telemetry approach) as the third altitude of the PlanetScale index-audit corpus — JD Lien is the pedagogy-altitude trade-off catalogue, Dicken is the structural-mechanism canon, Hazen is the production-telemetry canon.

— Savannah Longoria canonicalises EXPLAIN's key_len column as the only way to detect whether a composite secondary index is being used in full or only as a prefix — the column names the chosen index in key, but key_len (bytes) is the diagnostic that reveals whether later columns of the composite are binding the lookup or just post-filtering. The composite secondary index is canonicalised as a distinct shape from single-column secondary indexes via the (last_name, first_name) worked example (rows: 299,202 → 1); the post also names "two separate indexes" as the standard anti-pattern when a composite is needed. Canonical source for the composite-index for AND-predicate pattern — a composite secondary index is usually the right answer over Index Merge for WHERE A=x AND B=y workloads.
— Aaron Francis canonicalises the generated-hash-column pattern: an indexed BINARY(16) MD5 column is a compact fixed-width secondary index replacing a would-be-wide index on a TEXT column (which can only be prefix-indexed and therefore can't support strict equality). The composite variant (patterns/composite-hash-uniqueness-constraint) extends the shape: one 16-byte hash column with a single UNIQUE INDEX replaces a multi-column composite index as the multi-column uniqueness primitive. Canonical instance of the index-compactness lever — narrow index key = shallower B+tree = smaller hot working set = cheaper equality lookups, trading off any range/prefix/sort capability the original composite index would have provided.
— secondary-index structure in InnoDB; two-lookup query pattern; index creation syntax; primary-key size cascade.
systems/innodb — implementation substrate.
— Vitess's Lookup Vindex tables are literally relational tables with a secondary-index-like structure — (name, id) → keyspace_id or phone → keyspace_id stored as a separate MySQL table sharded by the lookup column. Canonical wiki example of the secondary-index shape escaping the storage-engine and becoming a sharding-layer routing primitive. The Consistent Lookup Vindex mechanism exists to keep these cross-shard "secondary indexes" consistent with their backing user table without 2PC.
sources/2026-04-21-planetscale-ai-powered-postgres-index-suggestions — production instance of the wiki's canonical discipline that secondary-index write-amplification is a real cost that has to be earned back by read-side wins before recommending a new index. PlanetScale Insights' AI index-suggestion feature encodes this as three pre-filter gates (rows-read:returned ratio, ≥ 0.1% runtime share, minimum frequency) because "indexes incur storage, memory and write overhead." Uses HypoPG + hypothetical-index evaluation to cost candidate indexes against Postgres's own planner before paying the write tax of actually creating them.
— Rafer Hazen (2024-08-14) canonicalises the production-observability counterpart to the static-time secondary-index decision: a per-query- pattern time-series surface (concepts/index-usage-time-series) showing which indexes are actually touched by the production workload, over time, in PlanetScale Insights' new Indexes tab. Captured via a patched InnoDB index_init() hook populating a per-query used-index set that rides on the final MySQL wire-protocol packet back to VTGate (patterns/handler-hook-sidecar-telemetry). Productises both directions of index attribution (patterns/bidirectional-index-query-attribution): per-query → indexes (the tab) + per-index → queries (the index:table.index predicate). Canonical drop-index signal: a secondary index showing 0% Insights hits across a week is a candidate for removal — but the SELECT-only caveat means operators must independently verify against UPDATE / DELETE paths (via EXPLAIN + performance_schema.table_io_waits_summary_by_index_usage) before acting. Canonical shift from the 2024-09-09 Dicken B-trees-altitude write-amplification framing (the structural cost of a secondary index) to the 2024-08-14 Hazen production-observability altitude (the measured return the workload gets from it) — the same decision surface, at two different altitudes.

Seen in — two-step lookup as deferred-join mechanism¶

— Mike Coutermarsh's FastPage post. The secondary index's two-step lookup property (walk the index B+tree to find primary keys, then walk the clustered index to hydrate full rows) is exactly the mechanism the deferred-join optimisation exploits. The rewrite forces the first query to run index-only (primary keys from the secondary-index leaves, no clustered-index descent) and reserves the full-row hydration cost for only the N primary keys that actually survive the LIMIT filter — collapsing the hydration count from M + N to N. Quantified gain on a 1M-row Rails Post table at offset 100: 1,228.7 ms → 457.3 ms (2.7×). Published in the fast_page Ruby gem (planetscale/fast_page). Cross-references High Performance MySQL for the pedagogical source: "It lets the server examine as little data as possible in an index without accessing rows."

Seen in — cost-driven externalisation at petabyte scale¶

sources/2024-08-01-segment-0-6m-year-savings-by-using-s3-for-change-data-capture-for-dynamodb — Secondary index as a cost-bound query surface that can be priced out of the base store at petabyte scale. Twilio Segment's objects pipeline needed a secondary index on (item-id, modified-timestamp) for warehouse-integration consumers of a ~1 PB / ~958B-item DynamoDB table, and explicitly rejected the in-database answer (DynamoDB Global Secondary Index) on cost grounds. Verbatim: "Ideally, we could have easily achieved this using DynamoDB's Global Secondary Index ... But due to the very large size of our table, creating a GSI for the table is not cost-efficient." Canonical wiki instance of the economic sign-flip: below some threshold, the in-database secondary index is operationally free; above it, the index must be externalised. Segment materialised the index as a CDC changelog in Bigtable (V1) then S3 (V2) — see concepts/changelog-as-secondary-index and patterns/object-store-as-cdc-log-store.