Skip to content

PATTERN Cited by 1 source

Generated hash column for equality lookup

Pattern: when a column is too large or too wide to index directly for strict-equality lookups, add a MySQL generated column that holds a compact hash of the value, index the hash column, and issue equality queries against the hash. The compact index is dramatically smaller than the original column, fits comfortably in a B+tree, and lives fully inside the storage engine so the application sees only normal SQL.

The canonical substrate is MySQL / InnoDB; the canonical disclosure is Aaron Francis's PlanetScale post.

(Source: sources/2026-04-21-planetscale-generated-hash-columns-in-mysql.)

When to use

  • The searched column is TEXT / BLOB / LONGTEXT / JSON and therefore can only be indexed via a prefix index — which doesn't support strict equality on the full value.
  • The searched column is a long VARCHAR (500+ bytes) that the engine technically can index but a hash index is much narrower and cheaper to maintain.
  • The searched column stores normalised data (URLs, canonical file paths, normalised addresses) where equality is the dominant access pattern and range / sort / prefix queries are rare.

When not to use

  • Range queries or prefix matches (LIKE 'foo%', ORDER BY) are also needed — hashing destroys all ordering, so an actual B-tree index on (a prefix of) the real column is required.
  • Full-text search is needed — use a FULLTEXT index or a dedicated search system (systems/opensearch, systems/elasticsearch).
  • The column is already narrow (e.g. a 32-byte UUID) — the original column is already a cheap index key; hashing it adds bytes, not saves them.

Shape

CREATE TABLE visits (
  url      TEXT,
  url_md5  BINARY(16) AS (UNHEX(MD5(url))),  -- generated hash
  -- [other columns...]
  KEY(url_md5)
);

-- Application query:
SELECT * FROM visits WHERE url_md5 = UNHEX(MD5('https://planetscale.com'));

Three design decisions visible in the shape:

  1. Binary, not character. BINARY(16) over CHAR(32) — MD5 output is 128 bits; stored as hex it doubles to 32 bytes and drags in a character set and collation that don't matter for byte-equality. UNHEX(MD5(...)) round-trips through raw bytes.
  2. Generated, not hand-maintained. The hash value is derived from the row's own column via a deterministic function; MySQL regenerates it on every write. No dual-write consistency problem, no backfill race during schema change, no forgotten UPDATE.
  3. Indexed, not just stored. The hash is useless without KEY(url_md5) — the whole point is to have something that fits in a B+tree.

Upgrading hash strength

Algorithm Output bytes Column type Collision profile
CRC32 4 INT UNSIGNED Collisions common; reliability bad
MD5 16 BINARY(16) Cryptographically broken but collision-free in practice for any reasonable schema
SHA2(... , 256) 32 BINARY(32) Collision probability beneath hardware failure rate

If the correctness cost of an MD5 collision is non-zero, either upgrade to SHA-256 or pair the hash equality with a redundant predicate on the original column — see patterns/redundant-hash-plus-value-predicate and the concepts/redundant-condition-query-hint framing.

Interaction with functional indexes

MySQL 8.0.13+ offers functional indexes — the equivalent of this pattern with the hash column hidden inside the index. Runtime behaviour is identical; the named-column form is preferred when the hash is referenced by name in application SQL.

Composite-column variant

For multi-column equality or uniqueness (enforce that an address tuple is unique across four fields), the single- column shape extends to the composite-hash pattern — see patterns/composite-hash-uniqueness-constraint — which adds CONCAT_WS + UNIQUE INDEX on the hash column.

Seen in

Last updated · 347 distilled / 1,201 read