PlanetScale — Character sets and collations in MySQL¶
Summary¶
Short PlanetScale pedagogical post by Aaron Francis
(originally published 2023-05-18, re-surfaced via the
2026-04-21 feed snapshot) canonicalising two fundamental
but frequently-confused MySQL string-column
settings: character sets (the set of legal characters
for a column) and collations (the rules for comparing
and sorting those characters). The post walks through
information_schema.character_sets (41 charset rows with
DEFAULT_COLLATE_NAME and MAXLEN byte-count per
character), spotlights the canonical MySQL trap — the
utf8 charset has MAXLEN=3 while utf8mb4 has
MAXLEN=4, meaning MySQL's historical utf8 "was never
actually UTF-8" (three bytes per character instead of the
four-byte-maximum UTF-8 spec) and should no longer be used
in MySQL 8 where utf8mb4 is the default — the collation
naming scheme (utf8mb4_0900_ai_ci decomposes as
charset + UCA-9.0.0 weight keys + accent-insensitive +
case-insensitive; suffixes _ai / _as / _ci / _cs /
_ks / _bin), and the inheritance hierarchy for
both charset and collation: column > table > database >
server default, with the most-specific level winning. The
post demonstrates the hierarchy with worked CREATE TABLE
examples that mix table-level CHARSET=latin1 with
column-level CHARACTER SET latin1 overrides, and shows
case-sensitivity as a direct consequence of collation
choice via SELECT "MySQL" COLLATE utf8mb4_0900_ai_ci =
"mysql" COLLATE utf8mb4_0900_ai_ci returning 1 (equal)
vs SELECT "MySQL" COLLATE utf8mb4_0900_as_cs = "mysql"
COLLATE utf8mb4_0900_as_cs returning 0 (unequal). Closing
recommendation: when in doubt, the MySQL 8 defaults
(utf8mb4 + utf8mb4_0900_ai_ci) are the right choice —
full Unicode coverage, case-insensitive, accent-insensitive.
Key takeaways¶
- Character sets define legal characters; collations define comparison rules. Each character set has one default collation, and every collation belongs to exactly one character set. "A character set defines the characters allowed to go in a column. A collation is a set of rules for comparing those characters. Each character set can have multiple collations, but a collation may only belong to one character set." (Source: article §"Introduction".)
- MySQL's
utf8is a three-byte impostor;utf8mb4is the real UTF-8. "According to the UTF-8 spec, each character is allowed four bytes, meaning MySQL'sutf8charset was never actually UTF-8 since it only supported three bytes per character. In MySQL 8,utf8mb4is the default character set and the one you will use most often.utf8is left for backwards compatibility and should no longer be used." The three-byteutf8covers only the Unicode Basic Multilingual Plane (BMP) — which excludes all supplementary-plane code points including emoji, many CJK extension characters, and historical scripts. Attempting to store a 4-byte UTF-8 sequence in autf8-typed column produces anIncorrect string valueerror or silent truncation depending on server mode. Canonical production gotcha. - Collation suffixes encode comparison axes —
_aiaccent-insensitive,_asaccent-sensitive,_cicase-insensitive,_cscase-sensitive,_kskana-sensitive (Japanese hiragana vs katakana),_binbinary byte-for-byte. MySQL 8's modern collation family is prefixed with UCA version keys like0900(Unicode Collation Algorithm 9.0.0) — the default collation forutf8mb4isutf8mb4_0900_ai_ci(accent-insensitive + case-insensitive + UCA 9.0.0 weight keys). Accent insensitivity is what makesrésumé = resumeunder the default. - String comparison sensitivity is a collation
property, not a SQL-standard property — the answer to
"are string comparisons case-sensitive?" is "it
depends on the collation." Demonstrated by the
COLLATEcast operator: Same data, different answers, determined entirely by which collation governs the comparison. - Column-level settings override table-level settings,
which override database-level settings, which override
server defaults. "A column-level specification will
override a table-level specification, a table-level
specification overrides the database default, and a
database-level charset overrides the server default."
Worked example: a table declared
ENGINE=InnoDB DEFAULT CHARSET=utf8mb4with one column declaredVARCHAR(255) CHARACTER SET latin1produces a table whoseimplicitly_setcolumn uses the table-levelutf8mb4but whoseexplicitly_setcolumn uses the column-levellatin1. Same hierarchy applies to collation: if a collation is not explicitly defined, MySQL uses the default collation of the character set at whichever level owns the charset declaration. SHOW CREATE TABLEreveals the resolved settings. Creating a table with no charset declaration and then runningSHOW CREATE TABLEshows the server default has been applied:ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci. This is the diagnostic primitive for "what charset/collation did I actually end up with?" — don't trust what you thought you declared; look at what the server resolved it to.- Each character set has exactly one
DEFAULT_COLLATE_NAME.utf8mb4's default isutf8mb4_0900_ai_ciin MySQL 8 (upgraded from the olderutf8mb4_general_cidefault in MySQL 5.7). Other character sets have language-named defaults (latin1_swedish_ci,big5_chinese_ci,gbk_chinese_ci,cp1250_general_ci) — reflecting the historical origin of each charset. These defaults are what you get if you specify the charset without specifying a collation. - Safe-defaults recommendation: MySQL 8 defaults
(
utf8mb4+utf8mb4_0900_ai_ci) are the right choice when in doubt — full Unicode coverage including emoji and supplementary planes, case-insensitive comparison (familiar to most application developers), and accent-insensitive comparison (useful for user-facing search). Explicit override only when requirements dictate (case-sensitive columns for code-like fields; binary collation for byte-exact comparison; language- specific collations for locale-accurate sort order).
Systems / concepts / patterns extracted¶
- Systems: MySQL (the relational
database whose charset + collation semantics the post
canonicalises), InnoDB (the default
storage engine — all
SHOW CREATE TABLEoutputs showENGINE=InnoDBalongside the charset/collation; the storage engine honours whatever the parser resolved), PlanetScale (the publishing vendor; the post is a pedagogy piece without PlanetScale-specific machinery). - Concepts: character set
(the set of legal characters for a string column, with
MAXLENbyte-count per character), collation (the rules for comparing and sorting characters, with the_ai/_as/_ci/_cs/_ks/_binsuffix taxonomy), utf8mb4 vs utf8 (the canonical MySQL trap:utf8is 3-byte and covers only the BMP;utf8mb4is the real UTF-8; MySQL 8 fixed the default). - Patterns: None. The post is pedagogical reference material; the "most-specific level wins" inheritance rule is a schema-design convention documented in-page on the character-set + collation concepts rather than a reusable cross-system pattern.
Operational numbers¶
- 41 character sets enumerated in
information_schema.character_setson MySQL 8. Notable rows:utf8mb4(MAXLEN=4, defaultutf8mb4_0900_ai_ci);utf8(MAXLEN=3, defaultutf8_general_ci— back-compat only);latin1(MAXLEN=1, defaultlatin1_swedish_ci);ascii(MAXLEN=1);binary(MAXLEN=1, no collation); four UTF variants for BMP-or-beyond storage (ucs2,utf16,utf16le,utf32). - Collation-suffix table: 6 canonical axes —
_aiaccent-insensitive,_asaccent-sensitive,_cicase-insensitive,_cscase-sensitive,_kskana-sensitive,_binbinary. - MySQL 8 default collation upgrade:
utf8mb4_0900_ai_ci(MySQL 8.0+) vsutf8mb4_general_ci(MySQL 5.7). Upgrade paths require explicit collation handling for tables created under 5.7 that want 8.0 semantics —ALTER TABLE ... CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_cire-sorts + rewrites.
Caveats¶
- Pedagogy voice, not a production-retrospective or architecture-disclosure post — no incident narrative, no failure modes at scale, no PlanetScale-specific charset handling disclosed.
- No performance numbers — the post doesn't quantify
the overhead of charset conversion on
CONVERT(... USING utf8mb4), buffer-pool cost of 4-byte vs 1-byte character columns, or index-size impact of wider charset choices. - Emoji / supplementary-plane coverage is asserted as
the
utf8mb4selling point but not worked through with concrete examples (noINSERT ... VALUES ('💩')failure demo on autf8column). - Migration guidance is thin: the post asserts "
utf8should no longer be used" but doesn't specify how to migrate legacyutf8tables safely — character-set changes areALTER TABLEoperations that can rewrite the entire table depending on implementation, which matters enormously at production scale (see the 2026-04-21 Announcing Vitess 21 disclosure that PlanetScale's Online DDL now does programmatic text conversion for charset changes rather than relying onCONVERT(... USING utf8mb4)for exactly this reason). - PostgreSQL comparison omitted — the post is MySQL-only; doesn't note that Postgres uses a fundamentally different model (database-level encoding
- OS-collation or ICU-provided collation) where the MySQL column-level granularity doesn't exist.
- UCA version history elided —
utf8mb4_0900_ai_cinames UCA 9.0.0 without explaining what "weight keys" are, why UCA versions matter, or what changes between UCA versions. binary(N)type vsbinarycharset distinction — thebinarycharset is described as a "pseudo charset" for byte-exact comparison; not clearly distinguished from theBINARY/VARBINARYcolumn types which are orthogonal to the charset concept.- No
TEXT/BLOBcolumn discussion — all worked examples useVARCHAR(255); the article doesn't comment on whetherTEXTcolumns (which are charset-aware) behave identically toVARCHAR.
Source¶
- Original: https://planetscale.com/blog/mysql-charsets-collations
- Raw markdown:
raw/planetscale/2026-04-21-character-sets-and-collations-in-mysql-42321391.md