PlanetScale — Indexing JSON in MySQL¶
Summary¶
Aaron Francis's pedagogical PlanetScale post (originally
2022-10-04, re-surfaced via the 2026-04-21 RSS snapshot)
canonicalises how to get a usable B-tree index on a key
inside a MySQL JSON column — a capability the JSON type
(introduced in MySQL 5.7.8, mid-2015) exposes via extraction
functions but not via any direct GIN-style index. MySQL
offers no Generalised Inverted Index
(GIN) equivalent to the one Postgres uses to index whole
documents, so the pragmatic path is to index parts of a
stored document by materialising the extracted value into an
indexable shape. Two shapes, one mechanism: (1) a named
generated column whose
expression is the JSON extraction
(properties->>"$.request.email"), then a standard
ALTER TABLE ... ADD INDEX email(email) on it — available
since MySQL 5.7; (2) a
functional index
(MySQL 8.0.13+) that inlines the same expression in
double-parentheses ((expr)) with no named column. Load-
bearing new wiki primitives: (a) the ->> JSON extraction
operator — shorthand for JSON_UNQUOTE(JSON_EXTRACT(col,
path)) — canonicalised as concepts/mysql-json-extraction-operator;
(b) the optimizer's expression-rewrite / index-substitution
capability — the planner "sees through" a query that uses
the raw JSON extraction expression and rewrites it to
reference the indexed generated column, so callers who don't
know the generated column exists still get the index — a
specific instance of a more general optimizer-index-
expression-rewrite behaviour worth canonicalising on its own
as concepts/optimizer-index-expression-rewrite; (c) the
functional-index collation-mismatch gotcha — JSON_UNQUOTE
returns utf8mb4_bin but CAST(... AS CHAR(N)) defaults to
utf8mb4_0900_ai_ci, so an otherwise-identical functional
index silently fails to be used until the cast explicitly
requests COLLATE utf8mb4_bin — canonicalised as
concepts/functional-index-collation-mismatch. The
specialised composition is the new
index-json-key-
via-generated-column pattern — a JSON-specific
specialisation of the already-canonical
patterns/generated-hash-column-for-equality-lookup
schema-design shape. Aaron Francis's third canonical wiki
ingest (after 2023-05-18 charset + collation and 2023-06-15
generated-hash-columns), consolidating his pedagogical-voice
MySQL-reference corpus on the wiki.
Key takeaways¶
- MySQL has no GIN-style whole-document JSON index — index parts instead. "In other databases, the best way to directly index a JSON column is usually through a type of index known as a Generalized Inverted Index, or GIN for short. Since MySQL doesn't offer GIN indexes, we're unable to directly index an entire stored JSON document. All is not lost though, because MySQL does give us a way to indirectly index parts of our stored JSON documents." This is the load-bearing structural claim: the pattern works because some specific JSON path is the frequent filter predicate, not because you need whole-document indexability. If you genuinely need the latter, MySQL is the wrong substrate and you should switch to a document store (Postgres + GIN, MongoDB, OpenSearch).
properties->>"$.request.email"is the canonical extraction expression. The->>operator is a shorthand forJSON_UNQUOTE(JSON_EXTRACT(column, path))— extract the value at a JSON path and strip the wrapping double-quotes from the returned JSON string so the result is a plain SQL string, not"email@example.com". Canonical->>operator page. WithoutJSON_UNQUOTEthe extraction returns the value as a JSON string literal with quotes, which compares correctly at the JSON level but doesn't match a plain-stringWHERE email = '...'predicate.- Generated-column path (MySQL 5.7+): one
ALTERadds a typed column; a secondALTERadds an index.ALTER TABLE activity_log ADD COLUMN email VARCHAR(255) GENERATED ALWAYS AS (properties->>"$.request.email");thenALTER TABLE activity_log ADD INDEX email(email) USING BTREE;. The developer choseVARCHAR(255)— a human decision — which is directly indexable by the B-tree substrate without special handling. concepts/generated-column-mysql canonical. - Optimizer rewrites queries to use the indexed
expression (named column path). Canonical verbatim:
"Even though we didn't explicitly address the column by
name, the optimizer understands that there is an index on
a generated column based on that expression and opts to
use the index."
EXPLAIN SELECT * FROM activity_log WHERE properties->>"$.request.email" = '...';reportspossible_keys: email,key: email,key_len: 768.SHOW WARNINGSreveals the rewrite: "selectactivity_log.email... where (activity_log.email= 'little.bobby@tables.com')" — the optimizer substitutes the indexed-column reference for the raw JSON-extraction expression. This is especially valuable "if you're unable to control the access pattern because the query is being issued from a 3rd party package in your codebase." Canonical optimizer index-expression rewrite page canonicalises this as a transferable property — any planner that matches expression signatures against indexed expressions can do this; examples beyond MySQL include Postgres expression-index matching and SQL Server persisted- computed-column matching. - Functional-index path (MySQL 8.0.13+) carries two
silent gotchas for JSON. Naive attempt
ALTER TABLE activity_log ADD INDEX email ((properties->>"$.request.email")) USING BTREE;fails with a nasty error: "Cannot create a functional index on an expression that returns a BLOB or TEXT. Please consider using CAST." The error is loud and points to the fix — add aCASTto a bounded type. But the fix-with-CAST introduces a silent second gotcha canonicalised as functional-index collation mismatch: the cast toCHAR(255)defaults toutf8mb4_0900_ai_ciwhileJSON_UNQUOTEreturnsutf8mb4_bin; collation mismatch means the index isn't used (EXPLAINshowstype: ALL,key: NULL) — no error, just silently worse performance. Fix:COLLATE utf8mb4_bininside theCAST. Canonical final syntax:ALTER TABLE activity_log ADD INDEX email ((CAST(properties->>"$.request.email" AS CHAR(255)) COLLATE utf8mb4_bin)) USING BTREE;. - Functional indexes are implemented as hidden virtual
generated columns. "A functional index is implemented
using a hidden generated column! We no longer have to
create the generated column, but a generated column is
still being created." The on-disk shape is identical;
the difference is purely ergonomic (syntactic). Because
the hidden column is auto-generated by the server, the
user doesn't get to pick its type — MySQL infers from
the expression's return type, which is why
JSON_UNQUOTEreturningLONGTEXTforces the manualCASTstep for JSON extraction. - Named generated column wins on readability and
optimizer access. The post closes with a pragmatic
preference: "Remember that functional indexes use hidden
generated columns under the hood. If you prefer to take
control of the generated column yourself (even in MySQL
8.0.13 and later) that's a perfectly reasonable
approach!" In practice the named-column form is more
common because it appears in
SHOW CREATE TABLE, can be projected bySELECT, and avoids theLONGTEXT/collation cast dance. Functional indexes win when the derived value has no use outside the index — you want to keep the logical schema narrow. - Generalises beyond JSON. "Don't just stop with
JSON, either! You can use generated columns and
functional indexes across all types of common, hard to
index patterns." The canonical generalisation on the
wiki is patterns/generated-hash-column-for-equality-lookup
(indexing long
TEXT/BLOBvalues via aBINARY(16)MD5 hash); this post adds JSON extraction as the third canonical application (alongside hashing a long value and normalising a case-insensitive lookup).
Worked example (verbatim)¶
Schema and payload¶
CREATE TABLE activity_log (
id int(11) NOT NULL AUTO_INCREMENT,
properties json NOT NULL,
created_at timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
);
Row shape (one per user action):
{
"uuid": "e7af5df8-f477-4b9b-b074-ad72fe17f502",
"request": {
"email": "little.bobby@tables.com",
"firstName": "Little",
"formType": "vehicle-inquiry",
"lastName": "Bobby",
"message": "Hello, can you tell me what the specs are for this vehicle?",
"postcode": "75016",
"townCity": "Dallas"
}
}
Goal: fast WHERE email = '...' lookup on request.email.
Extraction — verify the expression first¶
SELECT properties->>"$.request.email" FROM activity_log;
-- → little.bobby@tables.com
-- Equivalent longhand:
SELECT JSON_UNQUOTE(JSON_EXTRACT(properties, "$.request.email"))
FROM activity_log;
-- → little.bobby@tables.com
Path A — Named generated column (MySQL 5.7+)¶
ALTER TABLE activity_log ADD COLUMN email VARCHAR(255)
GENERATED ALWAYS AS (properties->>"$.request.email");
ALTER TABLE activity_log ADD INDEX email (email) USING BTREE;
EXPLAIN on WHERE email = '...' → key: email,
key_len: 768. EXPLAIN on WHERE
properties->>"$.request.email" = '...' (raw expression) →
same plan; optimizer rewrites via
expression
rewrite. SHOW WARNINGS after the second EXPLAIN
reveals the rewrite in the Note message — the WHERE
clause has been replaced with activity_log.email = '...'.
Path B — Functional index (MySQL 8.0.13+)¶
Naive form fails:
ALTER TABLE activity_log ADD INDEX email
((properties->>"$.request.email")) USING BTREE;
-- ERROR: Cannot create a functional index on an expression
-- that returns a BLOB or TEXT. Please consider using CAST.
Add CAST — still doesn't work because of silent collation
mismatch:
ALTER TABLE activity_log ADD INDEX email
((CAST(properties->>"$.request.email" AS CHAR(255))))
USING BTREE;
-- No error, but EXPLAIN reports key: NULL, type: ALL
Final working syntax — explicit COLLATE utf8mb4_bin to
match JSON_UNQUOTE's output collation:
ALTER TABLE activity_log ADD INDEX email ((
CAST(properties->>"$.request.email" AS CHAR(255))
COLLATE utf8mb4_bin
)) USING BTREE;
-- EXPLAIN reports key: email, key_len: 1023.
key_len at 1023 vs 768 on the named-column path reflects
the difference between VARCHAR(255) in
utf8mb4_0900_ai_ci (≈ 3 × 255 = 765, +3 = 768) and
CHAR(255) COLLATE utf8mb4_bin (4 × 255 = 1020, +3 = 1023)
— the functional-index path stores slightly wider keys.
Architectural significance¶
Part-indexing over whole-indexing is MySQL's JSON story¶
The canonical disclaimer: "While direct JSON indexing may
not be available in MySQL, indirect indexing of specific
keys can cover a majority of use cases." This is the load-
bearing substrate-choice observation. If your workload filters
consistently on a small, fixed set of JSON paths (e.g. just
email + formType + postcode), indirect-indexing is
sufficient; index one generated column per hot path. If your
workload needs whole-document indexing — arbitrary
predicates on arbitrary paths — MySQL is wrong and you
should move to a substrate that supports it natively
(Postgres jsonb + GIN, MongoDB,
OpenSearch).
Optimizer rewrite is the ergonomic load-bearer¶
Most MySQL schema-design patterns that index a derived value
are leaky in one direction: callers need to know about
the derived column to hit the index. The optimizer's
expression-rewrite capability closes that leak: callers can
continue to write the natural JSON-extraction predicate
(WHERE properties->>"$.request.email" = ?) and the planner
substitutes the indexed-column reference silently. This is
what makes the pattern viable in a codebase where application
queries are generated by a third-party library — an ORM,
reporting tool, or query builder — that has no awareness of
the schema's generated columns. It's a concept worth
canonicalising on its own: generalises beyond JSON to any
expression-index case (hash columns, case-folded columns,
coerced numerics, derived booleans).
Collation-mismatch is a canonical silent-failure mode¶
The functional-index collation gotcha is silent and
unsignalled. No error, no warning — just an EXPLAIN
output that claims the index doesn't apply and a query that
table-scans. The diagnostic step is to notice
type: ALL + key: NULL + possible_keys: NULL and
realise the index isn't even a candidate. The fix is buried
several layers deep in the MySQL documentation. Canonical
enough failure mode to have its own concept page:
concepts/functional-index-collation-mismatch.
Generalises beyond JSON — any functional index whose
expression returns an explicit collation but whose query
predicate is cast to a different collation will hit the same
silent-failure shape.
Caveats¶
- No production numbers. Pedagogical post; no row counts, no before/after latency, no production deployment disclosure.
- Pattern scope is equality and prefix. The
generated-column-index shape supports
=,IN,ORDER BY, andLIKE 'prefix%'on the extracted value just like any regular string column; it does not support$.path = ? AND $.other_path = ?over arbitrary paths without one generated column per path — which doesn't scale if the access pattern is truly unbounded. - Write-amplification tax implicit. Every index on a
generated column costs a
secondary-index write on
every
INSERT/UPDATEthat touches the source JSON. Post doesn't quantify. VIRTUALvsSTOREDnot discussed for this case. The named-column example uses the defaultVIRTUAL(implicit inGENERATED ALWAYS AS); for deeply-nested JSON paths over large documents theSTOREDform can amortise extraction cost to write time. Post doesn't surface the choice — rule of thumb from the sibling Francis post applies: "if it's expensive to calculate the value, store it."- Collation gotcha is a footgun, not guidance. The
post walks through the failure-fix cycle but doesn't
offer a rule like "always
CAST ... COLLATE utf8mb4_binwhen the underlying expression returns a JSON string." Canonical rule would be: match collation to the source expression's collation whenever it differs from the session default. - No multi-path indexing discussion. Composite indexes
on multiple generated columns (e.g. hot-path filter on
email+formType) not walked through — but follow directly from normal composite-index rules applied to the generated columns. LIKE '%suffix'orLIKE '%mid%'not supported. The generated-column index is a left-anchored B-tree; unanchoredLIKEstill table-scans. No mitigation suggested (the Postgres-GIN / full-text-search approach isn't available via this mechanism).- Post-2022 MySQL evolution not re-audited. Post
targets MySQL 8.0 at 2022 status; MySQL 8.4 and later may
have added features (e.g. multi-valued indexes for JSON
arrays, added in MySQL 8.0.17, actually predates the
post — author chose not to cover them). Multi-valued
indexes are the canonical path for indexing JSON
arrays (e.g.
"$.tags[*]") which this post's scalar- value approach doesn't support. - 2022-era post. Published 2022-10-04; re-surfaced via 2026-04-21 RSS feed. Three-and-a-half years of MySQL evolution and PlanetScale product changes elided.
Source¶
- Original: https://planetscale.com/blog/indexing-json-in-mysql
- Raw markdown:
raw/planetscale/2026-04-21-indexing-json-in-mysql-815a75f5.md
Related¶
- systems/mysql
- systems/innodb
- systems/planetscale
- concepts/mysql-json-extraction-operator
- concepts/optimizer-index-expression-rewrite
- concepts/functional-index-collation-mismatch
- concepts/generated-column-mysql
- concepts/functional-index-mysql
- concepts/secondary-index
- concepts/collation
- concepts/utf8mb4-vs-utf8
- patterns/index-json-key-via-generated-column
- patterns/generated-hash-column-for-equality-lookup
- companies/planetscale