PlanetScale — Generated Hash Columns in MySQL¶
Summary¶
Aaron Francis's PlanetScale pedagogical post (originally
published 2023-06-15, re-surfaced via the 2026-04-21 feed
snapshot) canonicalises a MySQL schema-design pattern the
MySQL 8 manual
buries in a single remark in §8.3.6: when a column is too
large to index directly or a tuple is too wide for a
composite index to pay off, introduce a
generated column holding
a compact hash of the value(s), index the hash, and use
the compact index for strict-equality lookups and
uniqueness enforcement. The post walks the single-column
case (patterns/generated-hash-column-for-equality-lookup,
URL stored in TEXT, url_md5 BINARY(16) AS (UNHEX(MD5(url)))
plus KEY(url_md5)), the multi-column case
(patterns/composite-hash-uniqueness-constraint, CASS-
normalised addresses hashed via CONCAT_WS('|', ...) with a
UNIQUE INDEX enforcing tuple uniqueness at commit), the
redundant-condition
technique for eliminating any residual MD5-collision risk
(patterns/redundant-hash-plus-value-predicate, WHERE
url_md5 = UNHEX(MD5(?)) AND url = ?), and the
MySQL 8.0.13+
functional-index
alternative that inlines the same mechanism with no named
column. Three structural arguments run through the piece:
(1) BINARY(N) over CHAR(2N) because byte-equality
comparison doesn't care about character-set or collation;
(2) generated-column-is-always-in-sync over hand-maintained
denormalisation; (3) the hash is a query-performance
primitive, not a security primitive — "this technique has
nothing to do with securely storing passwords or other
sensitive information." Aaron Francis joins the wiki's
PlanetScale named-voice roster with his second canonical
ingest after the 2023-05-18 charset + collation post.
Key takeaways¶
- MySQL generated columns express derived values the
engine keeps in sync. "A generated column can be
considered a calculated, computed, or derived column.
It is a column whose value results from an expression
rather than direct data input… The value in the
generated column is always up to date and can never fall
out of sync." Canonical syntax:
col_name data_type [GENERATED ALWAYS] AS (expr) [VIRTUAL | STORED]. The expression must be scalar and deterministic — noNOW(), noRAND(). Canonical generated column concept. VIRTUALvsSTOREDis a cheap-vs-expensive trade-off.VIRTUAL(the default) stores no bytes and recomputes on every read;STOREDwrites the computed value to disk and reads it directly. "Performance metrics ofVIRTUALorSTOREDare highly dependent on the situation, but a good rule of thumb is that if it's expensive to calculate the value, store it." Both shapes are indexable; a virtual column's index stores the computed value in its leaves even though the row doesn't.- The MySQL manual itself recommends the hash-column
pattern for long strings. "Rather than testing for
equality against a very long text string, you can store
a hash of the column value in a separate column, index
that column, and test the hashed value in queries. (Use
the
MD5()orCRC32()function to produce the hash value.)" — MySQL 8 manual, Optimising BLOB. The vendor endorses the pattern precisely becauseTEXT/BLOBcolumns can't carry a full B-tree index; they require a prefix index, which doesn't support strict equality on the full value. - Security disclaimer is load-bearing. "This
technique has nothing to do with securely storing
passwords or other sensitive information. Securely
storing sensitive information is an entirely separate
topic. We're using hashing functions to create very
small, deterministic results for speedy lookups, not to
protect information." Password storage needs salted +
iterated cryptographic hashing (bcrypt, scrypt, argon2);
this pattern uses
MD5/CRC32precisely for speed, not cryptographic properties. - Binary column over character column:
BINARY(16)overCHAR(32)for MD5. An MD5 hash is 128 bits = 16 bytes. Stored as hex under aCHAR(32)it takes 32 bytes and drags in character set + collation machinery that means nothing for byte-equality comparison. "Since [byte equality] is true, storing the data in a character representation is unnecessary. Storing the value of the hash as aBINARY(16)is more efficient than storing it as aCHAR(32)." The round-trip isUNHEX(MD5(value))on write,WHERE hash_col = UNHEX(MD5(?))on read. SHA-256 scales the shape toBINARY(32)viaUNHEX(SHA2(value, 256)). - Composite hash over
CONCAT_WS, neverCONCAT. PlainCONCATproduces two hazards: ambiguous tuples (CONCAT('a','b','c')andCONCAT('ab','c')both return'abc') andNULLpropagation (CONCAT('ab', NULL, 'c')returnsNULL).CONCAT_WS('|', ...)fixes both — the separator keeps tuple boundaries distinct, andNULLs are skipped instead of nuking the whole result. The canonical CASS-addresses example:Canonicalised as patterns/composite-hash-uniqueness-constraint.ALTER TABLE addresses ADD COLUMN address_hash BINARY(16) GENERATED ALWAYS AS ( UNHEX(MD5( CONCAT_WS('|', primary_line, secondary_line, urbanization, last_line) )) ); ALTER TABLE addresses ADD UNIQUE INDEX (address_hash); - Redundant conditions steer the optimiser without
changing results. "A redundant condition is a
condition that does not change the results of the query
but does provide MySQL with the option to choose a
different execution path, usually an index-assisted
path." Worked deployment: eliminate MD5 collision risk
while keeping the index-seek shape by adding the
original-column predicate alongside the hash predicate:
Canonicalised as concepts/redundant-condition-query-hint + patterns/redundant-hash-plus-value-predicate.
SELECT * FROM visits WHERE url_md5 = UNHEX(MD5('https://planetscale.com')) AND url = 'https://planetscale.com'; - Functional indexes (MySQL 8.0.13+) inline the same
mechanism without a named column. "Functional indexes
are implemented as virtual generated columns by MySQL,
so there is no performance difference. It's merely a
preference!" Syntax is double-parenthesised to
distinguish an expression from a column reference:
Canonical concepts/functional-index-mysql page.
ALTER TABLE addresses ADD INDEX address_hash_functional (( UNHEX(SHA2( CONCAT_WS('|', primary_line, secondary_line, urbanization, last_line), 256 )) ));
Systems / concepts / patterns extracted¶
- Systems: MySQL (substrate — the engine whose generated-column + functional-index features the post canonicalises), InnoDB (the default storage engine where the hash column's B+tree secondary index materialises), PlanetScale (publisher; no PlanetScale-specific internals disclosed — the pattern is engine-agnostic MySQL reference material).
- Concepts (new): concepts/generated-column-mysql, concepts/generated-hash-column, concepts/blob-text-index-prefix-requirement, concepts/redundant-condition-query-hint, concepts/functional-index-mysql.
- Patterns (new): patterns/generated-hash-column-for-equality-lookup, patterns/composite-hash-uniqueness-constraint, patterns/redundant-hash-plus-value-predicate.
- Related existing wiki pages:
concepts/secondary-index (every indexed hash column
is a secondary-index on the hash),
concepts/b-plus-tree (why compact keys matter —
narrow index = shallower tree = fewer I/Os),
concepts/character-set +
concepts/collation (why
BINARYoverCHARfor hashes), patterns/database-as-final-arbiter-of-uniqueness (theUNIQUE INDEXon composite hash as commit-time serialisation).
Operational numbers¶
None disclosed. Pedagogy-voice post — no production measurements of hash-column lookup latency, no before/after query-plan comparisons, no secondary-index size comparisons between composite-column and composite- hash shapes, no collision-rate observations from running fleets.
Caveats¶
- Pedagogy-voice, no production retrospective. Aaron Francis walks through the mechanism and the trade-offs but doesn't back any claim with measurements from a running deployment.
- Collision-probability math elided. The phrase "vanishingly small" for MD5 is not quantified, and the CRC32 contrast is stated as "much higher" without numbers. In practice: MD5 has 128-bit output, birthday collision probability ~50% at 2^64 rows (~10^19) — beyond any plausible table size; CRC32 has 32-bit output, birthday collision probability ~50% at ~65,500 rows — reachable in many production tables and therefore not a safe choice without the redundant-predicate pattern.
- Range / sort / prefix queries not supported. The pattern destroys ordering by design; any schema needing range or prefix access on the hashed column still needs a separate index on a prefix or derivative of the original column.
- No guidance on when hash strength upgrade is worth
migration cost. Post names SHA-256 as the upgrade path
but doesn't discuss the zero-downtime migration shape
(add new
BINARY(32)generated column, backfill, dual read, cut over, drop old), which a production adopter would need. VIRTUALvsSTOREDdecision rule is qualitative. "If it's expensive to calculate, store it" is directional but doesn't offer a benchmarked threshold.CONCAT_WSseparator-collision hazard not flagged. If any source column can contain the chosen separator byte (the post uses|), tuple boundaries re-ambiguate. Production schemas typically use a non-ASCII separator or an ASCII control byte (0x1Funit separator) — the post omits this guidance.- Index-write amplification tax implicit. Every indexed hash column pays a secondary-index write tax; the post doesn't quantify it or discuss when the read- side wins don't earn it back.
- PostgreSQL comparison absent. Postgres has expression indexes on arbitrary expressions (including hash functions) natively, without generated columns — the pattern is native to Postgres in a way that MySQL retrofits via the generated-column / functional-index machinery. Post is MySQL-specific by design; no acknowledgement of the Postgres contrast.
Source¶
- Original: https://planetscale.com/blog/generated-hash-columns-in-mysql
- Raw markdown:
raw/planetscale/2026-04-21-generated-hash-columns-in-mysql-7d29edfb.md
Related¶
- systems/mysql
- systems/innodb
- systems/planetscale
- concepts/generated-column-mysql
- concepts/generated-hash-column
- concepts/blob-text-index-prefix-requirement
- concepts/redundant-condition-query-hint
- concepts/functional-index-mysql
- concepts/secondary-index
- concepts/b-plus-tree
- patterns/generated-hash-column-for-equality-lookup
- patterns/composite-hash-uniqueness-constraint
- patterns/redundant-hash-plus-value-predicate
- companies/planetscale