Skip to content

CONCEPT Cited by 1 source

Generated hash column

A generated hash column is a MySQL generated column whose expression is a deterministic hash function (MD5, CRC32, SHA2(... , 256), etc.) applied to one or more other columns in the same row. Its purpose is to compress a wide or long value into a compact fixed-width token that can be efficiently indexed and used as an equality-lookup key.

The technique is not about security — passwords and secrets need cryptographic hashing + salting + KDFs, not this — and Aaron Francis flags the point explicitly: "this technique has nothing to do with securely storing passwords or other sensitive information. Securely storing sensitive information is an entirely separate topic. We're using hashing functions to create very small, deterministic results for speedy lookups, not to protect information."

(Source: sources/2026-04-21-planetscale-generated-hash-columns-in-mysql.)

Motivating problem

Two shapes of problem land on the same solution:

  1. A column too large to index directly. TEXT and BLOB columns can't carry a full B-tree index — MySQL requires a prefix index instead, which only indexes the first N bytes and therefore can't support strict-equality lookups on the whole value. Indexing a fixed-width hash of the column gives you the equality path that a prefix index can't.
  2. Uniqueness or equality over many columns. A composite B-tree index across 5 or 10 columns is wide (every level of the tree carries all those bytes in its keys), whereas a composite hash of the same tuple is one fixed-width column with one narrow index, and can carry a UNIQUE constraint enforced at commit time.

The underlying principle MySQL's own documentation spells out:

As an alternative to a composite index, you can introduce a column that is "hashed" based on information from other columns. If this column is short, reasonably unique, and indexed, it might be faster than a "wide" index on many columns. — MySQL 8 manual §8.3.6

Single-column shape

CREATE TABLE visits (
  url TEXT,
  url_md5 BINARY(16) AS (UNHEX(MD5(url))),  -- generated
  -- [other columns...]
  KEY(url_md5)
);

SELECT * FROM visits WHERE url_md5 = UNHEX(MD5('https://planetscale.com'));

Two architectural choices visible in the shape:

  • BINARY(16) over CHAR(32). An MD5 hash is 128 bits = 16 bytes. Stored as hex characters under a character- column type like CHAR(32) it takes 32 bytes and carries a character set and collation — neither of which matter for byte-equality comparison. BINARY(16) + UNHEX round-trip halves the storage and skips the collation layer entirely. The same argument gives BINARY(32) for SHA-256 (256 bits / 8 = 32 bytes).
  • AS (UNHEX(MD5(url))) is deterministic and cheap enough that either VIRTUAL or STORED works. The cost trade-off is Aaron Francis's standard guideline — hashing a long TEXT on every read is expensive, so STORED amortises it to write time.

Multi-column shape

For multi-column equality or uniqueness:

ALTER TABLE addresses ADD COLUMN address_hash BINARY(16)
  GENERATED ALWAYS AS (
    UNHEX(MD5(
      CONCAT_WS('|', primary_line, secondary_line, urbanization, last_line)
    ))
  );
ALTER TABLE addresses ADD UNIQUE INDEX (address_hash);

The CONCAT_WS ("concat with separator") choice is load-bearing — see patterns/composite-hash-uniqueness-constraint — because plain CONCAT produces two distinct inputs ("a","b","c" vs "ab","c") that would collapse to the same hash, and CONCAT returns NULL if any argument is NULL, whereas CONCAT_WS skips NULLs and emits the separator between non-null arguments.

Collision handling

MD5 collisions are vanishingly unlikely in practice but are not zero, and CRC32 collisions are materially more likely because CRC32 outputs only 32 bits. Three production responses on a spectrum:

  1. Tolerate the risk. For most query-performance- oriented hash columns, MD5 collision probability is below hardware-failure rates; accept it and move on.
  2. Pair the hash equality with a full-value predicate — the patterns/redundant-hash-plus-value-predicate / redundant condition pattern. The query becomes WHERE url_md5 = UNHEX(MD5(?)) AND url = ?. The second predicate narrows the result set to the actual match in the (impossibly rare) collision case; the first predicate lets the planner use the index.
  3. Switch to a wider hash. SHA-256 (BINARY(32) via UNHEX(SHA2(... , 256))) has collision probability small enough to be treated as zero for any practical schema.

Write and storage cost

Every indexed generated hash column pays the concepts/secondary-index write-amplification tax: each INSERT/UPDATE that touches the underlying columns triggers a rehash and an index update. The trade is worth taking when the query workload is dominated by equality lookups on wide values; it's not worth taking for tables that are mostly-write-rarely-read or for columns whose equality-lookup traffic is negligible.

Seen in

  • sources/2026-04-21-planetscale-generated-hash-columns-in-mysql — canonical wiki introduction. Aaron Francis walks through the single-column URL case (url_md5) and the multi- column CASS-address case (address_hash), including the BINARY(16) vs CHAR(32) storage trade-off, UNHEX(MD5(...)) as the canonical shape, collision-safe redundant conditions, and SHA-256 upgrade path.
  • systems/mysql — the substrate.
  • systems/innodb — storage engine under the secondary index.
Last updated · 347 distilled / 1,201 read