Skip to content

PATTERN Cited by 1 source

Index JSON key via generated column

Problem

MySQL's JSON data type lets you store arbitrary documents in a single column, but the column itself cannot be indexed directly — MySQL has no Generalised Inverted Index (GIN) equivalent to Postgres's jsonb + GIN combination. Without help, any WHERE properties->>"$.request.email" = ? query table-scans.

When the workload filters consistently on a small, fixed set of JSON paths (e.g. email, formType, postcode on a form-submission log), the pragmatic answer is not to switch substrates but to indirectly index specific keys by materialising the extracted value into a B-tree index.

Solution

Two interchangeable shapes; pick the one that fits the schema-documentation style of the codebase.

Shape A — Named generated column (MySQL 5.7+)

ALTER TABLE activity_log
  ADD COLUMN email VARCHAR(255)
  GENERATED ALWAYS AS (properties->>"$.request.email");

ALTER TABLE activity_log
  ADD INDEX email (email) USING BTREE;
  • The generated column is VIRTUAL by default — zero extra bytes in the row, expression evaluated on read. Index leaves store the computed value, so indexed lookup works identically to a stored column.
  • The column appears in SHOW CREATE TABLE, can be projected by SELECT *, and can be referenced by name in other queries.
  • VARCHAR(255) is a human decision — directly indexable by the B-tree substrate without a CAST or COLLATE dance.

Shape B — Functional index (MySQL 8.0.13+)

ALTER TABLE activity_log ADD INDEX email ((
  CAST(properties->>"$.request.email" AS CHAR(255))
    COLLATE utf8mb4_bin
)) USING BTREE;
  • Declared as a single ALTER TABLE — no named column; the server creates a hidden virtual generated column and indexes that. The on-disk shape is identical to Shape A.
  • Two required wrappings:
  • CAST(... AS CHAR(255))JSON_UNQUOTE returns LONGTEXT, which isn't directly indexable. Without the cast, CREATE INDEX fails with "Cannot create a functional index on an expression that returns a BLOB or TEXT."
  • COLLATE utf8mb4_bin — without it, CAST ... AS CHAR defaults to utf8mb4_0900_ai_ci while JSON_UNQUOTE returns utf8mb4_bin, producing silent functional- index collation mismatch (index is built but never used — no error, no warning, just a full table scan).
  • Logical schema stays narrow — no extra column to document or explain.

Why the pattern works

Optimizer rewrites raw expressions

The load-bearing ergonomic property is optimizer index-expression rewrite: a query written against the raw JSON-extraction expression

SELECT * FROM activity_log
  WHERE properties->>"$.request.email" = 'x@example.com';

is rewritten by the planner to use the indexed column (Shape A: WHERE activity_log.email = ...; Shape B: against the hidden generated column). Callers that don't know the index exists — third-party ORMs, reporting tools, query builders — still hit the index.

->> vs -> matters

The JSON extraction operator must be ->> (with JSON_UNQUOTE), not -> (raw JSON). The unquoted form returns a plain SQL string that compares correctly with = literals. The non-unquoted form returns a JSON string literal with wrapping double-quotes, which won't match a plain-string predicate.

Worked example (verbatim from Aaron Francis)

CREATE TABLE activity_log (
  id int(11) NOT NULL AUTO_INCREMENT,
  properties json NOT NULL,
  created_at timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (id)
);

Row shape: { "request": { "email": "little.bobby@tables.com", "formType": "vehicle-inquiry", ... }, "uuid": "..." }.

Verify the extraction first before indexing:

SELECT properties->>"$.request.email" FROM activity_log;
-- → little.bobby@tables.com

Shape A: key_len: 768 (3 × 255 + 3, VARCHAR(255) in utf8mb4_0900_ai_ci). Shape B: key_len: 1023 (4 × 255 + 3, CHAR(255) COLLATE utf8mb4_bin). SHOW WARNINGS after an EXPLAIN of a raw-expression query reveals the rewrite in the Note message.

(Source: sources/2026-04-21-planetscale-indexing-json-in-mysql.)

When to use

  • Equality, IN, prefix LIKE, ORDER BY on a small, fixed set of JSON paths — each a separate generated column + index.
  • Third-party SQL: the optimizer rewrite means callers don't need to know about the derived column.
  • Schema-flexibility trade-off is already made: the column is already JSON; this is the minimum mechanism to make it queryable fast.

When not to use

  • Arbitrary-path predicates. If queries filter on $.path_1 = ? OR $.path_2 = ? with paths varying per query, one index per path doesn't scale. Move to Postgres jsonb + GIN, MongoDB, or OpenSearch — substrates with true whole-document indexability.
  • JSON arrays needing per-element lookup. Use MySQL 8.0.17+ multi-valued indexes (e.g. on "$.tags[*]") rather than a scalar-extracting generated column.
  • Unanchored LIKE '%suffix%'. B-tree indexes only accelerate left-anchored prefix matches; wildcard-prefix predicates still table-scan.
  • Per-row write cost matters. Every index on a generated column costs a secondary-index write on every INSERT/UPDATE that touches the source JSON. Pick the hottest-path keys, not every key.

Shape-choice heuristic

Choose Shape A (named column) when... Choose Shape B (functional index) when...
Extracted value is useful in its own right (projected by SELECT, displayed, joined on). Value is only a planner fingerprint — nobody references it.
You want SHOW CREATE TABLE to document the derivation. You want to keep the logical schema narrow.
You want to avoid CAST + COLLATEVARCHAR(255) just works. You accept the collation-mismatch dance as one-time setup cost.

Aaron Francis's closing pragmatism: "Remember that functional indexes use hidden generated columns under the hood. If you prefer to take control of the generated column yourself (even in MySQL 8.0.13 and later) that's a perfectly reasonable approach!" In practice the named-column form is more common.

Relationship to the generated-hash-column pattern

This is a JSON-specific specialisation of patterns/generated-hash-column-for-equality-lookup — same mechanism (GENERATED ALWAYS AS (...) + ADD INDEX), different extraction expression. The hash-column variant compresses a long TEXT value to a 16-byte hash and indexes the hash for equality lookup; the JSON-key variant extracts a single scalar from a JSON document and indexes the extracted value. Both ultimately depend on the same primitive: an indexed deterministic expression over row data, whose match against query predicates is handled by the optimizer.

Aaron Francis's framing: "Don't just stop with JSON, either! You can use generated columns and functional indexes across all types of common, hard to index patterns."

Seen in

  • sources/2026-04-21-planetscale-indexing-json-in-mysql — canonical wiki introduction. Aaron Francis walks through both Shape A and Shape B end-to-end on a form-submission log, including the CAST + COLLATE gotchas for Shape B and EXPLAIN + SHOW WARNINGS evidence of optimizer rewrite for both shapes.
Last updated · 550 distilled / 1,221 read