Skip to content

PLANETSCALE 2023-06-15 Tier 3

Read original ↗

PlanetScale — Generated Hash Columns in MySQL

Summary

Aaron Francis's PlanetScale pedagogical post (originally published 2023-06-15, re-surfaced via the 2026-04-21 feed snapshot) canonicalises a MySQL schema-design pattern the MySQL 8 manual buries in a single remark in §8.3.6: when a column is too large to index directly or a tuple is too wide for a composite index to pay off, introduce a generated column holding a compact hash of the value(s), index the hash, and use the compact index for strict-equality lookups and uniqueness enforcement. The post walks the single-column case (patterns/generated-hash-column-for-equality-lookup, URL stored in TEXT, url_md5 BINARY(16) AS (UNHEX(MD5(url))) plus KEY(url_md5)), the multi-column case (patterns/composite-hash-uniqueness-constraint, CASS- normalised addresses hashed via CONCAT_WS('|', ...) with a UNIQUE INDEX enforcing tuple uniqueness at commit), the redundant-condition technique for eliminating any residual MD5-collision risk (patterns/redundant-hash-plus-value-predicate, WHERE url_md5 = UNHEX(MD5(?)) AND url = ?), and the MySQL 8.0.13+ functional-index alternative that inlines the same mechanism with no named column. Three structural arguments run through the piece: (1) BINARY(N) over CHAR(2N) because byte-equality comparison doesn't care about character-set or collation; (2) generated-column-is-always-in-sync over hand-maintained denormalisation; (3) the hash is a query-performance primitive, not a security primitive — "this technique has nothing to do with securely storing passwords or other sensitive information." Aaron Francis joins the wiki's PlanetScale named-voice roster with his second canonical ingest after the 2023-05-18 charset + collation post.

Key takeaways

  1. MySQL generated columns express derived values the engine keeps in sync. "A generated column can be considered a calculated, computed, or derived column. It is a column whose value results from an expression rather than direct data input… The value in the generated column is always up to date and can never fall out of sync." Canonical syntax: col_name data_type [GENERATED ALWAYS] AS (expr) [VIRTUAL | STORED]. The expression must be scalar and deterministic — no NOW(), no RAND(). Canonical generated column concept.
  2. VIRTUAL vs STORED is a cheap-vs-expensive trade-off. VIRTUAL (the default) stores no bytes and recomputes on every read; STORED writes the computed value to disk and reads it directly. "Performance metrics of VIRTUAL or STORED are highly dependent on the situation, but a good rule of thumb is that if it's expensive to calculate the value, store it." Both shapes are indexable; a virtual column's index stores the computed value in its leaves even though the row doesn't.
  3. The MySQL manual itself recommends the hash-column pattern for long strings. "Rather than testing for equality against a very long text string, you can store a hash of the column value in a separate column, index that column, and test the hashed value in queries. (Use the MD5() or CRC32() function to produce the hash value.)" — MySQL 8 manual, Optimising BLOB. The vendor endorses the pattern precisely because TEXT / BLOB columns can't carry a full B-tree index; they require a prefix index, which doesn't support strict equality on the full value.
  4. Security disclaimer is load-bearing. "This technique has nothing to do with securely storing passwords or other sensitive information. Securely storing sensitive information is an entirely separate topic. We're using hashing functions to create very small, deterministic results for speedy lookups, not to protect information." Password storage needs salted + iterated cryptographic hashing (bcrypt, scrypt, argon2); this pattern uses MD5 / CRC32 precisely for speed, not cryptographic properties.
  5. Binary column over character column: BINARY(16) over CHAR(32) for MD5. An MD5 hash is 128 bits = 16 bytes. Stored as hex under a CHAR(32) it takes 32 bytes and drags in character set + collation machinery that means nothing for byte-equality comparison. "Since [byte equality] is true, storing the data in a character representation is unnecessary. Storing the value of the hash as a BINARY(16) is more efficient than storing it as a CHAR(32)." The round-trip is UNHEX(MD5(value)) on write, WHERE hash_col = UNHEX(MD5(?)) on read. SHA-256 scales the shape to BINARY(32) via UNHEX(SHA2(value, 256)).
  6. Composite hash over CONCAT_WS, never CONCAT. Plain CONCAT produces two hazards: ambiguous tuples (CONCAT('a','b','c') and CONCAT('ab','c') both return 'abc') and NULL propagation (CONCAT('ab', NULL, 'c') returns NULL). CONCAT_WS('|', ...) fixes both — the separator keeps tuple boundaries distinct, and NULLs are skipped instead of nuking the whole result. The canonical CASS-addresses example:
    ALTER TABLE addresses ADD COLUMN address_hash BINARY(16)
      GENERATED ALWAYS AS (
        UNHEX(MD5(
          CONCAT_WS('|', primary_line, secondary_line, urbanization, last_line)
        ))
      );
    ALTER TABLE addresses ADD UNIQUE INDEX (address_hash);
    
    Canonicalised as patterns/composite-hash-uniqueness-constraint.
  7. Redundant conditions steer the optimiser without changing results. "A redundant condition is a condition that does not change the results of the query but does provide MySQL with the option to choose a different execution path, usually an index-assisted path." Worked deployment: eliminate MD5 collision risk while keeping the index-seek shape by adding the original-column predicate alongside the hash predicate:
    SELECT * FROM visits
      WHERE url_md5 = UNHEX(MD5('https://planetscale.com'))
        AND url = 'https://planetscale.com';
    
    Canonicalised as concepts/redundant-condition-query-hint + patterns/redundant-hash-plus-value-predicate.
  8. Functional indexes (MySQL 8.0.13+) inline the same mechanism without a named column. "Functional indexes are implemented as virtual generated columns by MySQL, so there is no performance difference. It's merely a preference!" Syntax is double-parenthesised to distinguish an expression from a column reference:
    ALTER TABLE addresses ADD INDEX address_hash_functional ((
      UNHEX(SHA2(
        CONCAT_WS('|', primary_line, secondary_line, urbanization, last_line),
        256
      ))
    ));
    
    Canonical concepts/functional-index-mysql page.

Systems / concepts / patterns extracted

Operational numbers

None disclosed. Pedagogy-voice post — no production measurements of hash-column lookup latency, no before/after query-plan comparisons, no secondary-index size comparisons between composite-column and composite- hash shapes, no collision-rate observations from running fleets.

Caveats

  • Pedagogy-voice, no production retrospective. Aaron Francis walks through the mechanism and the trade-offs but doesn't back any claim with measurements from a running deployment.
  • Collision-probability math elided. The phrase "vanishingly small" for MD5 is not quantified, and the CRC32 contrast is stated as "much higher" without numbers. In practice: MD5 has 128-bit output, birthday collision probability ~50% at 2^64 rows (~10^19) — beyond any plausible table size; CRC32 has 32-bit output, birthday collision probability ~50% at ~65,500 rows — reachable in many production tables and therefore not a safe choice without the redundant-predicate pattern.
  • Range / sort / prefix queries not supported. The pattern destroys ordering by design; any schema needing range or prefix access on the hashed column still needs a separate index on a prefix or derivative of the original column.
  • No guidance on when hash strength upgrade is worth migration cost. Post names SHA-256 as the upgrade path but doesn't discuss the zero-downtime migration shape (add new BINARY(32) generated column, backfill, dual read, cut over, drop old), which a production adopter would need.
  • VIRTUAL vs STORED decision rule is qualitative. "If it's expensive to calculate, store it" is directional but doesn't offer a benchmarked threshold.
  • CONCAT_WS separator-collision hazard not flagged. If any source column can contain the chosen separator byte (the post uses |), tuple boundaries re-ambiguate. Production schemas typically use a non-ASCII separator or an ASCII control byte (0x1F unit separator) — the post omits this guidance.
  • Index-write amplification tax implicit. Every indexed hash column pays a secondary-index write tax; the post doesn't quantify it or discuss when the read- side wins don't earn it back.
  • PostgreSQL comparison absent. Postgres has expression indexes on arbitrary expressions (including hash functions) natively, without generated columns — the pattern is native to Postgres in a way that MySQL retrofits via the generated-column / functional-index machinery. Post is MySQL-specific by design; no acknowledgement of the Postgres contrast.

Source

Last updated · 347 distilled / 1,201 read