CONCEPT Cited by 1 source
Generated hash column¶
A generated hash column is a
MySQL generated column
whose expression is a deterministic hash function (MD5,
CRC32, SHA2(... , 256), etc.) applied to one or more
other columns in the same row. Its purpose is to compress
a wide or long value into a compact fixed-width token that
can be efficiently indexed and used as an equality-lookup
key.
The technique is not about security — passwords and secrets need cryptographic hashing + salting + KDFs, not this — and Aaron Francis flags the point explicitly: "this technique has nothing to do with securely storing passwords or other sensitive information. Securely storing sensitive information is an entirely separate topic. We're using hashing functions to create very small, deterministic results for speedy lookups, not to protect information."
(Source: sources/2026-04-21-planetscale-generated-hash-columns-in-mysql.)
Motivating problem¶
Two shapes of problem land on the same solution:
- A column too large to index directly.
TEXTandBLOBcolumns can't carry a full B-tree index — MySQL requires a prefix index instead, which only indexes the first N bytes and therefore can't support strict-equality lookups on the whole value. Indexing a fixed-width hash of the column gives you the equality path that a prefix index can't. - Uniqueness or equality over many columns. A
composite B-tree index across 5 or 10 columns is wide
(every level of the tree carries all those bytes in its
keys), whereas a
composite
hash of the same tuple is one fixed-width column with
one narrow index, and can carry a
UNIQUEconstraint enforced at commit time.
The underlying principle MySQL's own documentation spells out:
As an alternative to a composite index, you can introduce a column that is "hashed" based on information from other columns. If this column is short, reasonably unique, and indexed, it might be faster than a "wide" index on many columns. — MySQL 8 manual §8.3.6
Single-column shape¶
CREATE TABLE visits (
url TEXT,
url_md5 BINARY(16) AS (UNHEX(MD5(url))), -- generated
-- [other columns...]
KEY(url_md5)
);
SELECT * FROM visits WHERE url_md5 = UNHEX(MD5('https://planetscale.com'));
Two architectural choices visible in the shape:
BINARY(16)overCHAR(32). An MD5 hash is 128 bits = 16 bytes. Stored as hex characters under a character- column type likeCHAR(32)it takes 32 bytes and carries a character set and collation — neither of which matter for byte-equality comparison.BINARY(16)+UNHEXround-trip halves the storage and skips the collation layer entirely. The same argument givesBINARY(32)for SHA-256 (256 bits / 8 = 32 bytes).AS (UNHEX(MD5(url)))is deterministic and cheap enough that eitherVIRTUALorSTOREDworks. The cost trade-off is Aaron Francis's standard guideline — hashing a longTEXTon every read is expensive, soSTOREDamortises it to write time.
Multi-column shape¶
For multi-column equality or uniqueness:
ALTER TABLE addresses ADD COLUMN address_hash BINARY(16)
GENERATED ALWAYS AS (
UNHEX(MD5(
CONCAT_WS('|', primary_line, secondary_line, urbanization, last_line)
))
);
ALTER TABLE addresses ADD UNIQUE INDEX (address_hash);
The CONCAT_WS ("concat with separator") choice is
load-bearing — see
patterns/composite-hash-uniqueness-constraint — because
plain CONCAT produces two distinct inputs ("a","b","c"
vs "ab","c") that would collapse to the same hash, and
CONCAT returns NULL if any argument is NULL, whereas
CONCAT_WS skips NULLs and emits the separator between
non-null arguments.
Collision handling¶
MD5 collisions are vanishingly unlikely in practice but are not zero, and CRC32 collisions are materially more likely because CRC32 outputs only 32 bits. Three production responses on a spectrum:
- Tolerate the risk. For most query-performance- oriented hash columns, MD5 collision probability is below hardware-failure rates; accept it and move on.
- Pair the hash equality with a full-value predicate
— the
patterns/redundant-hash-plus-value-predicate /
redundant
condition pattern. The query becomes
WHERE url_md5 = UNHEX(MD5(?)) AND url = ?. The second predicate narrows the result set to the actual match in the (impossibly rare) collision case; the first predicate lets the planner use the index. - Switch to a wider hash. SHA-256 (
BINARY(32)viaUNHEX(SHA2(... , 256))) has collision probability small enough to be treated as zero for any practical schema.
Write and storage cost¶
Every indexed generated hash column pays the
concepts/secondary-index write-amplification tax: each
INSERT/UPDATE that touches the underlying columns
triggers a rehash and an index update. The trade is worth
taking when the query workload is dominated by equality
lookups on wide values; it's not worth taking for tables
that are mostly-write-rarely-read or for columns whose
equality-lookup traffic is negligible.
Seen in¶
- sources/2026-04-21-planetscale-generated-hash-columns-in-mysql
— canonical wiki introduction. Aaron Francis walks through
the single-column URL case (
url_md5) and the multi- column CASS-address case (address_hash), including theBINARY(16)vsCHAR(32)storage trade-off,UNHEX(MD5(...))as the canonical shape, collision-safe redundant conditions, and SHA-256 upgrade path. - systems/mysql — the substrate.
- systems/innodb — storage engine under the secondary index.
Related¶
- concepts/generated-column-mysql
- concepts/blob-text-index-prefix-requirement
- concepts/redundant-condition-query-hint
- concepts/functional-index-mysql
- concepts/secondary-index
- patterns/generated-hash-column-for-equality-lookup
- patterns/composite-hash-uniqueness-constraint
- patterns/redundant-hash-plus-value-predicate
- systems/mysql
- systems/innodb