PATTERN

Index JSON key via generated column¶

Problem¶

MySQL's JSON data type lets you store arbitrary documents in a single column, but the column itself cannot be indexed directly — MySQL has no Generalised Inverted Index (GIN) equivalent to Postgres's jsonb + GIN combination. Without help, any WHERE properties->>"$.request.email" = ? query table-scans.

When the workload filters consistently on a small, fixed set of JSON paths (e.g. email, formType, postcode on a form-submission log), the pragmatic answer is not to switch substrates but to indirectly index specific keys by materialising the extracted value into a B-tree index.

Solution¶

Two interchangeable shapes; pick the one that fits the schema-documentation style of the codebase.

Shape A — Named generated column (MySQL 5.7+)¶

ALTER TABLE activity_log
  ADD COLUMN email VARCHAR(255)
  GENERATED ALWAYS AS (properties->>"$.request.email");

ALTER TABLE activity_log
  ADD INDEX email (email) USING BTREE;

The generated column is VIRTUAL by default — zero extra bytes in the row, expression evaluated on read. Index leaves store the computed value, so indexed lookup works identically to a stored column.
The column appears in SHOW CREATE TABLE, can be projected by SELECT *, and can be referenced by name in other queries.
VARCHAR(255) is a human decision — directly indexable by the B-tree substrate without a CAST or COLLATE dance.

Shape B — Functional index (MySQL 8.0.13+)¶

ALTER TABLE activity_log ADD INDEX email ((
  CAST(properties->>"$.request.email" AS CHAR(255))
    COLLATE utf8mb4_bin
)) USING BTREE;

Declared as a single ALTER TABLE — no named column; the server creates a hidden virtual generated column and indexes that. The on-disk shape is identical to Shape A.
Two required wrappings:
CAST(... AS CHAR(255)) — JSON_UNQUOTE returns LONGTEXT, which isn't directly indexable. Without the cast, CREATE INDEX fails with "Cannot create a functional index on an expression that returns a BLOB or TEXT."
COLLATE utf8mb4_bin — without it, CAST ... AS CHAR defaults to utf8mb4_0900_ai_ci while JSON_UNQUOTE returns utf8mb4_bin, producing silent functional- index collation mismatch (index is built but never used — no error, no warning, just a full table scan).
Logical schema stays narrow — no extra column to document or explain.

Why the pattern works¶

Optimizer rewrites raw expressions¶

The load-bearing ergonomic property is optimizer index-expression rewrite: a query written against the raw JSON-extraction expression

SELECT * FROM activity_log
  WHERE properties->>"$.request.email" = 'x@example.com';

is rewritten by the planner to use the indexed column (Shape A: WHERE activity_log.email = ...; Shape B: against the hidden generated column). Callers that don't know the index exists — third-party ORMs, reporting tools, query builders — still hit the index.

`->>` vs `->` matters¶

The JSON extraction operator must be ->> (with JSON_UNQUOTE), not -> (raw JSON). The unquoted form returns a plain SQL string that compares correctly with = literals. The non-unquoted form returns a JSON string literal with wrapping double-quotes, which won't match a plain-string predicate.

Worked example (verbatim from Aaron Francis)¶

CREATE TABLE activity_log (
  id int(11) NOT NULL AUTO_INCREMENT,
  properties json NOT NULL,
  created_at timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (id)
);

Row shape: { "request": { "email": "little.bobby@tables.com", "formType": "vehicle-inquiry", ... }, "uuid": "..." }.

Verify the extraction first before indexing:

SELECT properties->>"$.request.email" FROM activity_log;
-- → little.bobby@tables.com

Shape A: key_len: 768 (3 × 255 + 3, VARCHAR(255) in utf8mb4_0900_ai_ci). Shape B: key_len: 1023 (4 × 255 + 3, CHAR(255) COLLATE utf8mb4_bin). SHOW WARNINGS after an EXPLAIN of a raw-expression query reveals the rewrite in the Note message.

(Source: .)

When to use¶

✅ Equality, IN, prefix LIKE, ORDER BY on a small, fixed set of JSON paths — each a separate generated column + index.
✅ Third-party SQL: the optimizer rewrite means callers don't need to know about the derived column.
✅ Schema-flexibility trade-off is already made: the column is already JSON; this is the minimum mechanism to make it queryable fast.

When not to use¶

❌ Arbitrary-path predicates. If queries filter on $.path_1 = ? OR $.path_2 = ? with paths varying per query, one index per path doesn't scale. Move to Postgres jsonb + GIN, MongoDB, or OpenSearch — substrates with true whole-document indexability.
❌ JSON arrays needing per-element lookup. Use MySQL 8.0.17+ multi-valued indexes (e.g. on "$.tags[*]") rather than a scalar-extracting generated column.
❌ Unanchored LIKE '%suffix%'. B-tree indexes only accelerate left-anchored prefix matches; wildcard-prefix predicates still table-scan.
❌ Per-row write cost matters. Every index on a generated column costs a secondary-index write on every INSERT/UPDATE that touches the source JSON. Pick the hottest-path keys, not every key.

Shape-choice heuristic¶

Choose Shape A (named column) when...	Choose Shape B (functional index) when...
Extracted value is useful in its own right (projected by `SELECT`, displayed, joined on).	Value is only a planner fingerprint — nobody references it.
You want `SHOW CREATE TABLE` to document the derivation.	You want to keep the logical schema narrow.
You want to avoid `CAST` + `COLLATE` — `VARCHAR(255)` just works.	You accept the collation-mismatch dance as one-time setup cost.

Aaron Francis's closing pragmatism: "Remember that functional indexes use hidden generated columns under the hood. If you prefer to take control of the generated column yourself (even in MySQL 8.0.13 and later) that's a perfectly reasonable approach!" In practice the named-column form is more common.

Relationship to the generated-hash-column pattern¶

This is a JSON-specific specialisation of patterns/generated-hash-column-for-equality-lookup — same mechanism (GENERATED ALWAYS AS (...) + ADD INDEX), different extraction expression. The hash-column variant compresses a long TEXT value to a 16-byte hash and indexes the hash for equality lookup; the JSON-key variant extracts a single scalar from a JSON document and indexes the extracted value. Both ultimately depend on the same primitive: an indexed deterministic expression over row data, whose match against query predicates is handled by the optimizer.

Aaron Francis's framing: "Don't just stop with JSON, either! You can use generated columns and functional indexes across all types of common, hard to index patterns."

Seen in¶

— canonical wiki introduction. Aaron Francis walks through both Shape A and Shape B end-to-end on a form-submission log, including the CAST + COLLATE gotchas for Shape B and EXPLAIN + SHOW WARNINGS evidence of optimizer rewrite for both shapes.