PATTERN Cited by 1 source
Index JSON key via generated column¶
Problem¶
MySQL's JSON data type
lets you store arbitrary documents in a single column, but
the column itself cannot be indexed directly — MySQL
has no Generalised Inverted
Index
(GIN) equivalent to Postgres's jsonb + GIN combination.
Without help, any WHERE properties->>"$.request.email" = ?
query table-scans.
When the workload filters consistently on a small, fixed
set of JSON paths (e.g. email, formType, postcode on
a form-submission log), the pragmatic answer is not to
switch substrates but to indirectly index specific keys
by materialising the extracted value into a B-tree
index.
Solution¶
Two interchangeable shapes; pick the one that fits the schema-documentation style of the codebase.
Shape A — Named generated column (MySQL 5.7+)¶
ALTER TABLE activity_log
ADD COLUMN email VARCHAR(255)
GENERATED ALWAYS AS (properties->>"$.request.email");
ALTER TABLE activity_log
ADD INDEX email (email) USING BTREE;
- The generated column is
VIRTUALby default — zero extra bytes in the row, expression evaluated on read. Index leaves store the computed value, so indexed lookup works identically to a stored column. - The column appears in
SHOW CREATE TABLE, can be projected bySELECT *, and can be referenced by name in other queries. VARCHAR(255)is a human decision — directly indexable by the B-tree substrate without aCASTorCOLLATEdance.
Shape B — Functional index (MySQL 8.0.13+)¶
ALTER TABLE activity_log ADD INDEX email ((
CAST(properties->>"$.request.email" AS CHAR(255))
COLLATE utf8mb4_bin
)) USING BTREE;
- Declared as a single
ALTER TABLE— no named column; the server creates a hidden virtual generated column and indexes that. The on-disk shape is identical to Shape A. - Two required wrappings:
CAST(... AS CHAR(255))—JSON_UNQUOTEreturnsLONGTEXT, which isn't directly indexable. Without the cast,CREATE INDEXfails with "Cannot create a functional index on an expression that returns a BLOB or TEXT."COLLATE utf8mb4_bin— without it,CAST ... AS CHARdefaults toutf8mb4_0900_ai_ciwhileJSON_UNQUOTEreturnsutf8mb4_bin, producing silent functional- index collation mismatch (index is built but never used — no error, no warning, just a full table scan).- Logical schema stays narrow — no extra column to document or explain.
Why the pattern works¶
Optimizer rewrites raw expressions¶
The load-bearing ergonomic property is optimizer index-expression rewrite: a query written against the raw JSON-extraction expression
is rewritten by the planner to use the indexed column
(Shape A: WHERE activity_log.email = ...; Shape B: against
the hidden generated column). Callers that don't know the
index exists — third-party ORMs, reporting tools, query
builders — still hit the index.
->> vs -> matters¶
The JSON extraction
operator must be ->> (with JSON_UNQUOTE), not ->
(raw JSON). The unquoted form returns a plain SQL string that
compares correctly with = literals. The non-unquoted form
returns a JSON string literal with wrapping double-quotes,
which won't match a plain-string predicate.
Worked example (verbatim from Aaron Francis)¶
CREATE TABLE activity_log (
id int(11) NOT NULL AUTO_INCREMENT,
properties json NOT NULL,
created_at timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
);
Row shape: { "request": { "email": "little.bobby@tables.com",
"formType": "vehicle-inquiry", ... }, "uuid": "..." }.
Verify the extraction first before indexing:
Shape A: key_len: 768 (3 × 255 + 3, VARCHAR(255) in
utf8mb4_0900_ai_ci). Shape B: key_len: 1023
(4 × 255 + 3, CHAR(255) COLLATE utf8mb4_bin). SHOW
WARNINGS after an EXPLAIN of a raw-expression query
reveals the rewrite in the Note message.
(Source: sources/2026-04-21-planetscale-indexing-json-in-mysql.)
When to use¶
- ✅ Equality,
IN, prefixLIKE,ORDER BYon a small, fixed set of JSON paths — each a separate generated column + index. - ✅ Third-party SQL: the optimizer rewrite means callers don't need to know about the derived column.
- ✅ Schema-flexibility trade-off is already made: the
column is already
JSON; this is the minimum mechanism to make it queryable fast.
When not to use¶
- ❌ Arbitrary-path predicates. If queries filter on
$.path_1 = ? OR $.path_2 = ?with paths varying per query, one index per path doesn't scale. Move to Postgresjsonb+ GIN, MongoDB, or OpenSearch — substrates with true whole-document indexability. - ❌ JSON arrays needing per-element lookup. Use MySQL
8.0.17+ multi-valued indexes (e.g. on
"$.tags[*]") rather than a scalar-extracting generated column. - ❌ Unanchored
LIKE '%suffix%'. B-tree indexes only accelerate left-anchored prefix matches; wildcard-prefix predicates still table-scan. - ❌ Per-row write cost matters. Every index on a
generated column costs a
secondary-index write on
every
INSERT/UPDATEthat touches the source JSON. Pick the hottest-path keys, not every key.
Shape-choice heuristic¶
| Choose Shape A (named column) when... | Choose Shape B (functional index) when... |
|---|---|
Extracted value is useful in its own right (projected by SELECT, displayed, joined on). |
Value is only a planner fingerprint — nobody references it. |
You want SHOW CREATE TABLE to document the derivation. |
You want to keep the logical schema narrow. |
You want to avoid CAST + COLLATE — VARCHAR(255) just works. |
You accept the collation-mismatch dance as one-time setup cost. |
Aaron Francis's closing pragmatism: "Remember that functional indexes use hidden generated columns under the hood. If you prefer to take control of the generated column yourself (even in MySQL 8.0.13 and later) that's a perfectly reasonable approach!" In practice the named-column form is more common.
Relationship to the generated-hash-column pattern¶
This is a JSON-specific specialisation of
patterns/generated-hash-column-for-equality-lookup — same
mechanism (GENERATED ALWAYS AS (...) + ADD INDEX),
different extraction expression. The hash-column variant
compresses a long TEXT value to a 16-byte hash and indexes
the hash for equality lookup; the JSON-key variant extracts
a single scalar from a JSON document and indexes the
extracted value. Both ultimately depend on the same
primitive: an indexed deterministic expression over row
data, whose match against query predicates is handled by the
optimizer.
Aaron Francis's framing: "Don't just stop with JSON, either! You can use generated columns and functional indexes across all types of common, hard to index patterns."
Seen in¶
- sources/2026-04-21-planetscale-indexing-json-in-mysql
— canonical wiki introduction. Aaron Francis walks through
both Shape A and Shape B end-to-end on a form-submission
log, including the
CAST+COLLATEgotchas for Shape B andEXPLAIN+SHOW WARNINGSevidence of optimizer rewrite for both shapes.
Related¶
- concepts/mysql-json-data-type
- concepts/mysql-json-extraction-operator
- concepts/generated-column-mysql
- concepts/functional-index-mysql
- concepts/functional-index-collation-mismatch
- concepts/optimizer-index-expression-rewrite
- concepts/secondary-index
- concepts/json-column-as-schema-escape-hatch
- concepts/collation
- patterns/generated-hash-column-for-equality-lookup
- systems/mysql
- systems/innodb
- systems/planetscale