Skip to content

PLANETSCALE 2022-10-04 Tier 3

Read original ↗

PlanetScale — Indexing JSON in MySQL

Summary

Aaron Francis's pedagogical PlanetScale post (originally 2022-10-04, re-surfaced via the 2026-04-21 RSS snapshot) canonicalises how to get a usable B-tree index on a key inside a MySQL JSON column — a capability the JSON type (introduced in MySQL 5.7.8, mid-2015) exposes via extraction functions but not via any direct GIN-style index. MySQL offers no Generalised Inverted Index (GIN) equivalent to the one Postgres uses to index whole documents, so the pragmatic path is to index parts of a stored document by materialising the extracted value into an indexable shape. Two shapes, one mechanism: (1) a named generated column whose expression is the JSON extraction (properties->>"$.request.email"), then a standard ALTER TABLE ... ADD INDEX email(email) on it — available since MySQL 5.7; (2) a functional index (MySQL 8.0.13+) that inlines the same expression in double-parentheses ((expr)) with no named column. Load- bearing new wiki primitives: (a) the ->> JSON extraction operator — shorthand for JSON_UNQUOTE(JSON_EXTRACT(col, path)) — canonicalised as concepts/mysql-json-extraction-operator; (b) the optimizer's expression-rewrite / index-substitution capability — the planner "sees through" a query that uses the raw JSON extraction expression and rewrites it to reference the indexed generated column, so callers who don't know the generated column exists still get the index — a specific instance of a more general optimizer-index- expression-rewrite behaviour worth canonicalising on its own as concepts/optimizer-index-expression-rewrite; (c) the functional-index collation-mismatch gotchaJSON_UNQUOTE returns utf8mb4_bin but CAST(... AS CHAR(N)) defaults to utf8mb4_0900_ai_ci, so an otherwise-identical functional index silently fails to be used until the cast explicitly requests COLLATE utf8mb4_bin — canonicalised as concepts/functional-index-collation-mismatch. The specialised composition is the new index-json-key- via-generated-column pattern — a JSON-specific specialisation of the already-canonical patterns/generated-hash-column-for-equality-lookup schema-design shape. Aaron Francis's third canonical wiki ingest (after 2023-05-18 charset + collation and 2023-06-15 generated-hash-columns), consolidating his pedagogical-voice MySQL-reference corpus on the wiki.

Key takeaways

  1. MySQL has no GIN-style whole-document JSON index — index parts instead. "In other databases, the best way to directly index a JSON column is usually through a type of index known as a Generalized Inverted Index, or GIN for short. Since MySQL doesn't offer GIN indexes, we're unable to directly index an entire stored JSON document. All is not lost though, because MySQL does give us a way to indirectly index parts of our stored JSON documents." This is the load-bearing structural claim: the pattern works because some specific JSON path is the frequent filter predicate, not because you need whole-document indexability. If you genuinely need the latter, MySQL is the wrong substrate and you should switch to a document store (Postgres + GIN, MongoDB, OpenSearch).
  2. properties->>"$.request.email" is the canonical extraction expression. The ->> operator is a shorthand for JSON_UNQUOTE(JSON_EXTRACT(column, path)) — extract the value at a JSON path and strip the wrapping double-quotes from the returned JSON string so the result is a plain SQL string, not "email@example.com". Canonical ->> operator page. Without JSON_UNQUOTE the extraction returns the value as a JSON string literal with quotes, which compares correctly at the JSON level but doesn't match a plain-string WHERE email = '...' predicate.
  3. Generated-column path (MySQL 5.7+): one ALTER adds a typed column; a second ALTER adds an index. ALTER TABLE activity_log ADD COLUMN email VARCHAR(255) GENERATED ALWAYS AS (properties->>"$.request.email"); then ALTER TABLE activity_log ADD INDEX email(email) USING BTREE;. The developer chose VARCHAR(255) — a human decision — which is directly indexable by the B-tree substrate without special handling. concepts/generated-column-mysql canonical.
  4. Optimizer rewrites queries to use the indexed expression (named column path). Canonical verbatim: "Even though we didn't explicitly address the column by name, the optimizer understands that there is an index on a generated column based on that expression and opts to use the index." EXPLAIN SELECT * FROM activity_log WHERE properties->>"$.request.email" = '...'; reports possible_keys: email, key: email, key_len: 768. SHOW WARNINGS reveals the rewrite: "select activity_log.email ... where (activity_log.email = 'little.bobby@tables.com')" — the optimizer substitutes the indexed-column reference for the raw JSON-extraction expression. This is especially valuable "if you're unable to control the access pattern because the query is being issued from a 3rd party package in your codebase." Canonical optimizer index-expression rewrite page canonicalises this as a transferable property — any planner that matches expression signatures against indexed expressions can do this; examples beyond MySQL include Postgres expression-index matching and SQL Server persisted- computed-column matching.
  5. Functional-index path (MySQL 8.0.13+) carries two silent gotchas for JSON. Naive attempt ALTER TABLE activity_log ADD INDEX email ((properties->>"$.request.email")) USING BTREE; fails with a nasty error: "Cannot create a functional index on an expression that returns a BLOB or TEXT. Please consider using CAST." The error is loud and points to the fix — add a CAST to a bounded type. But the fix-with-CAST introduces a silent second gotcha canonicalised as functional-index collation mismatch: the cast to CHAR(255) defaults to utf8mb4_0900_ai_ci while JSON_UNQUOTE returns utf8mb4_bin; collation mismatch means the index isn't used (EXPLAIN shows type: ALL, key: NULL) — no error, just silently worse performance. Fix: COLLATE utf8mb4_bin inside the CAST. Canonical final syntax: ALTER TABLE activity_log ADD INDEX email ((CAST(properties->>"$.request.email" AS CHAR(255)) COLLATE utf8mb4_bin)) USING BTREE;.
  6. Functional indexes are implemented as hidden virtual generated columns. "A functional index is implemented using a hidden generated column! We no longer have to create the generated column, but a generated column is still being created." The on-disk shape is identical; the difference is purely ergonomic (syntactic). Because the hidden column is auto-generated by the server, the user doesn't get to pick its type — MySQL infers from the expression's return type, which is why JSON_UNQUOTE returning LONGTEXT forces the manual CAST step for JSON extraction.
  7. Named generated column wins on readability and optimizer access. The post closes with a pragmatic preference: "Remember that functional indexes use hidden generated columns under the hood. If you prefer to take control of the generated column yourself (even in MySQL 8.0.13 and later) that's a perfectly reasonable approach!" In practice the named-column form is more common because it appears in SHOW CREATE TABLE, can be projected by SELECT, and avoids the LONGTEXT/collation cast dance. Functional indexes win when the derived value has no use outside the index — you want to keep the logical schema narrow.
  8. Generalises beyond JSON. "Don't just stop with JSON, either! You can use generated columns and functional indexes across all types of common, hard to index patterns." The canonical generalisation on the wiki is patterns/generated-hash-column-for-equality-lookup (indexing long TEXT/BLOB values via a BINARY(16) MD5 hash); this post adds JSON extraction as the third canonical application (alongside hashing a long value and normalising a case-insensitive lookup).

Worked example (verbatim)

Schema and payload

CREATE TABLE activity_log (
  id int(11) NOT NULL AUTO_INCREMENT,
  properties json NOT NULL,
  created_at timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (id)
);

Row shape (one per user action):

{
  "uuid": "e7af5df8-f477-4b9b-b074-ad72fe17f502",
  "request": {
    "email": "little.bobby@tables.com",
    "firstName": "Little",
    "formType": "vehicle-inquiry",
    "lastName": "Bobby",
    "message": "Hello, can you tell me what the specs are for this vehicle?",
    "postcode": "75016",
    "townCity": "Dallas"
  }
}

Goal: fast WHERE email = '...' lookup on request.email.

Extraction — verify the expression first

SELECT properties->>"$.request.email" FROM activity_log;
-- → little.bobby@tables.com

-- Equivalent longhand:
SELECT JSON_UNQUOTE(JSON_EXTRACT(properties, "$.request.email"))
  FROM activity_log;
-- → little.bobby@tables.com

Path A — Named generated column (MySQL 5.7+)

ALTER TABLE activity_log ADD COLUMN email VARCHAR(255)
  GENERATED ALWAYS AS (properties->>"$.request.email");

ALTER TABLE activity_log ADD INDEX email (email) USING BTREE;

EXPLAIN on WHERE email = '...'key: email, key_len: 768. EXPLAIN on WHERE properties->>"$.request.email" = '...' (raw expression) → same plan; optimizer rewrites via expression rewrite. SHOW WARNINGS after the second EXPLAIN reveals the rewrite in the Note message — the WHERE clause has been replaced with activity_log.email = '...'.

Path B — Functional index (MySQL 8.0.13+)

Naive form fails:

ALTER TABLE activity_log ADD INDEX email
  ((properties->>"$.request.email")) USING BTREE;
-- ERROR: Cannot create a functional index on an expression
-- that returns a BLOB or TEXT. Please consider using CAST.

Add CAST — still doesn't work because of silent collation mismatch:

ALTER TABLE activity_log ADD INDEX email
  ((CAST(properties->>"$.request.email" AS CHAR(255))))
  USING BTREE;
-- No error, but EXPLAIN reports key: NULL, type: ALL

Final working syntax — explicit COLLATE utf8mb4_bin to match JSON_UNQUOTE's output collation:

ALTER TABLE activity_log ADD INDEX email ((
  CAST(properties->>"$.request.email" AS CHAR(255))
    COLLATE utf8mb4_bin
)) USING BTREE;
-- EXPLAIN reports key: email, key_len: 1023.

key_len at 1023 vs 768 on the named-column path reflects the difference between VARCHAR(255) in utf8mb4_0900_ai_ci (≈ 3 × 255 = 765, +3 = 768) and CHAR(255) COLLATE utf8mb4_bin (4 × 255 = 1020, +3 = 1023) — the functional-index path stores slightly wider keys.

Architectural significance

Part-indexing over whole-indexing is MySQL's JSON story

The canonical disclaimer: "While direct JSON indexing may not be available in MySQL, indirect indexing of specific keys can cover a majority of use cases." This is the load- bearing substrate-choice observation. If your workload filters consistently on a small, fixed set of JSON paths (e.g. just email + formType + postcode), indirect-indexing is sufficient; index one generated column per hot path. If your workload needs whole-document indexing — arbitrary predicates on arbitrary paths — MySQL is wrong and you should move to a substrate that supports it natively (Postgres jsonb + GIN, MongoDB, OpenSearch).

Optimizer rewrite is the ergonomic load-bearer

Most MySQL schema-design patterns that index a derived value are leaky in one direction: callers need to know about the derived column to hit the index. The optimizer's expression-rewrite capability closes that leak: callers can continue to write the natural JSON-extraction predicate (WHERE properties->>"$.request.email" = ?) and the planner substitutes the indexed-column reference silently. This is what makes the pattern viable in a codebase where application queries are generated by a third-party library — an ORM, reporting tool, or query builder — that has no awareness of the schema's generated columns. It's a concept worth canonicalising on its own: generalises beyond JSON to any expression-index case (hash columns, case-folded columns, coerced numerics, derived booleans).

Collation-mismatch is a canonical silent-failure mode

The functional-index collation gotcha is silent and unsignalled. No error, no warning — just an EXPLAIN output that claims the index doesn't apply and a query that table-scans. The diagnostic step is to notice type: ALL + key: NULL + possible_keys: NULL and realise the index isn't even a candidate. The fix is buried several layers deep in the MySQL documentation. Canonical enough failure mode to have its own concept page: concepts/functional-index-collation-mismatch. Generalises beyond JSON — any functional index whose expression returns an explicit collation but whose query predicate is cast to a different collation will hit the same silent-failure shape.

Caveats

  • No production numbers. Pedagogical post; no row counts, no before/after latency, no production deployment disclosure.
  • Pattern scope is equality and prefix. The generated-column-index shape supports =, IN, ORDER BY, and LIKE 'prefix%' on the extracted value just like any regular string column; it does not support $.path = ? AND $.other_path = ? over arbitrary paths without one generated column per path — which doesn't scale if the access pattern is truly unbounded.
  • Write-amplification tax implicit. Every index on a generated column costs a secondary-index write on every INSERT/UPDATE that touches the source JSON. Post doesn't quantify.
  • VIRTUAL vs STORED not discussed for this case. The named-column example uses the default VIRTUAL (implicit in GENERATED ALWAYS AS); for deeply-nested JSON paths over large documents the STORED form can amortise extraction cost to write time. Post doesn't surface the choice — rule of thumb from the sibling Francis post applies: "if it's expensive to calculate the value, store it."
  • Collation gotcha is a footgun, not guidance. The post walks through the failure-fix cycle but doesn't offer a rule like "always CAST ... COLLATE utf8mb4_bin when the underlying expression returns a JSON string." Canonical rule would be: match collation to the source expression's collation whenever it differs from the session default.
  • No multi-path indexing discussion. Composite indexes on multiple generated columns (e.g. hot-path filter on email + formType) not walked through — but follow directly from normal composite-index rules applied to the generated columns.
  • LIKE '%suffix' or LIKE '%mid%' not supported. The generated-column index is a left-anchored B-tree; unanchored LIKE still table-scans. No mitigation suggested (the Postgres-GIN / full-text-search approach isn't available via this mechanism).
  • Post-2022 MySQL evolution not re-audited. Post targets MySQL 8.0 at 2022 status; MySQL 8.4 and later may have added features (e.g. multi-valued indexes for JSON arrays, added in MySQL 8.0.17, actually predates the post — author chose not to cover them). Multi-valued indexes are the canonical path for indexing JSON arrays (e.g. "$.tags[*]") which this post's scalar- value approach doesn't support.
  • 2022-era post. Published 2022-10-04; re-surfaced via 2026-04-21 RSS feed. Three-and-a-half years of MySQL evolution and PlanetScale product changes elided.

Source

Last updated · 378 distilled / 1,213 read