Skip to content

PLANETSCALE 2023-05-18 Tier 3

Read original ↗

PlanetScale — Character sets and collations in MySQL

Summary

Short PlanetScale pedagogical post by Aaron Francis (originally published 2023-05-18, re-surfaced via the 2026-04-21 feed snapshot) canonicalising two fundamental but frequently-confused MySQL string-column settings: character sets (the set of legal characters for a column) and collations (the rules for comparing and sorting those characters). The post walks through information_schema.character_sets (41 charset rows with DEFAULT_COLLATE_NAME and MAXLEN byte-count per character), spotlights the canonical MySQL trap — the utf8 charset has MAXLEN=3 while utf8mb4 has MAXLEN=4, meaning MySQL's historical utf8 "was never actually UTF-8" (three bytes per character instead of the four-byte-maximum UTF-8 spec) and should no longer be used in MySQL 8 where utf8mb4 is the default — the collation naming scheme (utf8mb4_0900_ai_ci decomposes as charset + UCA-9.0.0 weight keys + accent-insensitive + case-insensitive; suffixes _ai / _as / _ci / _cs / _ks / _bin), and the inheritance hierarchy for both charset and collation: column > table > database > server default, with the most-specific level winning. The post demonstrates the hierarchy with worked CREATE TABLE examples that mix table-level CHARSET=latin1 with column-level CHARACTER SET latin1 overrides, and shows case-sensitivity as a direct consequence of collation choice via SELECT "MySQL" COLLATE utf8mb4_0900_ai_ci = "mysql" COLLATE utf8mb4_0900_ai_ci returning 1 (equal) vs SELECT "MySQL" COLLATE utf8mb4_0900_as_cs = "mysql" COLLATE utf8mb4_0900_as_cs returning 0 (unequal). Closing recommendation: when in doubt, the MySQL 8 defaults (utf8mb4 + utf8mb4_0900_ai_ci) are the right choice — full Unicode coverage, case-insensitive, accent-insensitive.

Key takeaways

  1. Character sets define legal characters; collations define comparison rules. Each character set has one default collation, and every collation belongs to exactly one character set. "A character set defines the characters allowed to go in a column. A collation is a set of rules for comparing those characters. Each character set can have multiple collations, but a collation may only belong to one character set." (Source: article §"Introduction".)
  2. MySQL's utf8 is a three-byte impostor; utf8mb4 is the real UTF-8. "According to the UTF-8 spec, each character is allowed four bytes, meaning MySQL's utf8 charset was never actually UTF-8 since it only supported three bytes per character. In MySQL 8, utf8mb4 is the default character set and the one you will use most often. utf8 is left for backwards compatibility and should no longer be used." The three-byte utf8 covers only the Unicode Basic Multilingual Plane (BMP) — which excludes all supplementary-plane code points including emoji, many CJK extension characters, and historical scripts. Attempting to store a 4-byte UTF-8 sequence in a utf8-typed column produces an Incorrect string value error or silent truncation depending on server mode. Canonical production gotcha.
  3. Collation suffixes encode comparison axes_ai accent-insensitive, _as accent-sensitive, _ci case-insensitive, _cs case-sensitive, _ks kana-sensitive (Japanese hiragana vs katakana), _bin binary byte-for-byte. MySQL 8's modern collation family is prefixed with UCA version keys like 0900 (Unicode Collation Algorithm 9.0.0) — the default collation for utf8mb4 is utf8mb4_0900_ai_ci (accent-insensitive + case-insensitive + UCA 9.0.0 weight keys). Accent insensitivity is what makes résumé = resume under the default.
  4. String comparison sensitivity is a collation property, not a SQL-standard property — the answer to "are string comparisons case-sensitive?" is "it depends on the collation." Demonstrated by the COLLATE cast operator:
    SELECT "MySQL" COLLATE utf8mb4_0900_ai_ci = "mysql" COLLATE utf8mb4_0900_ai_ci;  -- 1
    SELECT "MySQL" COLLATE utf8mb4_0900_as_cs = "mysql" COLLATE utf8mb4_0900_as_cs;  -- 0
    
    Same data, different answers, determined entirely by which collation governs the comparison.
  5. Column-level settings override table-level settings, which override database-level settings, which override server defaults. "A column-level specification will override a table-level specification, a table-level specification overrides the database default, and a database-level charset overrides the server default." Worked example: a table declared ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 with one column declared VARCHAR(255) CHARACTER SET latin1 produces a table whose implicitly_set column uses the table-level utf8mb4 but whose explicitly_set column uses the column-level latin1. Same hierarchy applies to collation: if a collation is not explicitly defined, MySQL uses the default collation of the character set at whichever level owns the charset declaration.
  6. SHOW CREATE TABLE reveals the resolved settings. Creating a table with no charset declaration and then running SHOW CREATE TABLE shows the server default has been applied: ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci. This is the diagnostic primitive for "what charset/collation did I actually end up with?" — don't trust what you thought you declared; look at what the server resolved it to.
  7. Each character set has exactly one DEFAULT_COLLATE_NAME. utf8mb4's default is utf8mb4_0900_ai_ci in MySQL 8 (upgraded from the older utf8mb4_general_ci default in MySQL 5.7). Other character sets have language-named defaults (latin1_swedish_ci, big5_chinese_ci, gbk_chinese_ci, cp1250_general_ci) — reflecting the historical origin of each charset. These defaults are what you get if you specify the charset without specifying a collation.
  8. Safe-defaults recommendation: MySQL 8 defaults (utf8mb4 + utf8mb4_0900_ai_ci) are the right choice when in doubt — full Unicode coverage including emoji and supplementary planes, case-insensitive comparison (familiar to most application developers), and accent-insensitive comparison (useful for user-facing search). Explicit override only when requirements dictate (case-sensitive columns for code-like fields; binary collation for byte-exact comparison; language- specific collations for locale-accurate sort order).

Systems / concepts / patterns extracted

  • Systems: MySQL (the relational database whose charset + collation semantics the post canonicalises), InnoDB (the default storage engine — all SHOW CREATE TABLE outputs show ENGINE=InnoDB alongside the charset/collation; the storage engine honours whatever the parser resolved), PlanetScale (the publishing vendor; the post is a pedagogy piece without PlanetScale-specific machinery).
  • Concepts: character set (the set of legal characters for a string column, with MAXLEN byte-count per character), collation (the rules for comparing and sorting characters, with the _ai / _as / _ci / _cs / _ks / _bin suffix taxonomy), utf8mb4 vs utf8 (the canonical MySQL trap: utf8 is 3-byte and covers only the BMP; utf8mb4 is the real UTF-8; MySQL 8 fixed the default).
  • Patterns: None. The post is pedagogical reference material; the "most-specific level wins" inheritance rule is a schema-design convention documented in-page on the character-set + collation concepts rather than a reusable cross-system pattern.

Operational numbers

  • 41 character sets enumerated in information_schema.character_sets on MySQL 8. Notable rows: utf8mb4 (MAXLEN=4, default utf8mb4_0900_ai_ci); utf8 (MAXLEN=3, default utf8_general_ci — back-compat only); latin1 (MAXLEN=1, default latin1_swedish_ci); ascii (MAXLEN=1); binary (MAXLEN=1, no collation); four UTF variants for BMP-or-beyond storage (ucs2, utf16, utf16le, utf32).
  • Collation-suffix table: 6 canonical axes — _ai accent-insensitive, _as accent-sensitive, _ci case-insensitive, _cs case-sensitive, _ks kana-sensitive, _bin binary.
  • MySQL 8 default collation upgrade: utf8mb4_0900_ai_ci (MySQL 8.0+) vs utf8mb4_general_ci (MySQL 5.7). Upgrade paths require explicit collation handling for tables created under 5.7 that want 8.0 semantics — ALTER TABLE ... CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci re-sorts + rewrites.

Caveats

  • Pedagogy voice, not a production-retrospective or architecture-disclosure post — no incident narrative, no failure modes at scale, no PlanetScale-specific charset handling disclosed.
  • No performance numbers — the post doesn't quantify the overhead of charset conversion on CONVERT(... USING utf8mb4), buffer-pool cost of 4-byte vs 1-byte character columns, or index-size impact of wider charset choices.
  • Emoji / supplementary-plane coverage is asserted as the utf8mb4 selling point but not worked through with concrete examples (no INSERT ... VALUES ('💩') failure demo on a utf8 column).
  • Migration guidance is thin: the post asserts "utf8 should no longer be used" but doesn't specify how to migrate legacy utf8 tables safely — character-set changes are ALTER TABLE operations that can rewrite the entire table depending on implementation, which matters enormously at production scale (see the 2026-04-21 Announcing Vitess 21 disclosure that PlanetScale's Online DDL now does programmatic text conversion for charset changes rather than relying on CONVERT(... USING utf8mb4) for exactly this reason).
  • PostgreSQL comparison omitted — the post is MySQL-only; doesn't note that Postgres uses a fundamentally different model (database-level encoding
  • OS-collation or ICU-provided collation) where the MySQL column-level granularity doesn't exist.
  • UCA version history elidedutf8mb4_0900_ai_ci names UCA 9.0.0 without explaining what "weight keys" are, why UCA versions matter, or what changes between UCA versions.
  • binary(N) type vs binary charset distinction — the binary charset is described as a "pseudo charset" for byte-exact comparison; not clearly distinguished from the BINARY / VARBINARY column types which are orthogonal to the charset concept.
  • No TEXT / BLOB column discussion — all worked examples use VARCHAR(255); the article doesn't comment on whether TEXT columns (which are charset-aware) behave identically to VARCHAR.

Source

Last updated · 347 distilled / 1,201 read