CONCEPT Cited by 1 source
Character set¶
A character set defines the set of characters that can
legally be stored in a string column, plus the byte-length
encoding for each character. In MySQL,
every string column (CHAR, VARCHAR, TEXT, ENUM,
SET) has exactly one character set assigned to it — either
explicitly declared or inherited from the table, database,
or server default.
Definition¶
From PlanetScale's Aaron Francis: "A character set defines
the characters allowed to go in a column. A collation is a
set of rules for comparing those characters." The two
settings are paired — every character set has at least one
collation, and one designated
DEFAULT_COLLATE_NAME — but they govern different
operations (storage vs comparison) and can be overridden
independently.
(Source: sources/2026-04-21-planetscale-character-sets-and-collations-in-mysql.)
The information_schema.character_sets catalogue¶
MySQL ships 41 character sets, each with a default
collation, a description, and a MAXLEN — the maximum
number of bytes per character:
| CHARACTER_SET_NAME | DEFAULT_COLLATE_NAME | DESCRIPTION | MAXLEN |
|---|---|---|---|
| ascii | ascii_general_ci | US ASCII | 1 |
| binary | binary | Binary pseudo | 1 |
| latin1 | latin1_swedish_ci | cp1252 West Euro | 1 |
| utf8 | utf8_general_ci | UTF-8 Unicode | 3 |
| utf8mb4 | utf8mb4_0900_ai_ci | UTF-8 Unicode | 4 |
| ucs2 | ucs2_general_ci | UCS-2 Unicode | 2 |
| utf16 | utf16_general_ci | UTF-16 Unicode | 4 |
Two canonical groupings:
- Single-byte charsets (
ascii,latin1,cp1250, the various DOS and Mac codepages): MAXLEN=1 — storage is exactly one byte per character, so aVARCHAR(255)is exactly 255 bytes. - Unicode charsets (
utf8,utf8mb4,ucs2,utf16,utf32): MAXLEN ≥ 2 — storage is variable-length (utf8,utf8mb4) or fixed wider-than-byte (ucs22B,utf324B). AVARCHAR(255)inutf8mb4can need up to 1,020 bytes.
MAXLEN drives storage + index sizing¶
The MAXLEN is load-bearing for schema design because
MySQL's on-disk + index page-size limits are measured in
bytes, not characters. InnoDB's default max index key
length is 3,072 bytes; a VARCHAR(768) column in utf8mb4
(MAXLEN=4) hits that ceiling at 3,072 bytes and cannot be
indexed without a prefix, while the same column in latin1
(MAXLEN=1) consumes only 768 bytes and indexes fit easily.
This is why migration from latin1 → utf8mb4 sometimes
requires index-length adjustments (KEY (col(191)) instead
of KEY (col)) in older MySQL versions.
The inheritance hierarchy¶
Character sets are declared at four levels, with the most-specific winning:
- Column-level —
VARCHAR(255) CHARACTER SET latin1on an individual column. - Table-level —
CREATE TABLE t (...) DEFAULT CHARSET=latin1applies to all string columns without an explicit column-level charset. - Database-level —
ALTER DATABASE d DEFAULT CHARACTER SET utf8mb4applies to all tables in databasedwithout an explicit table-level charset. - Server default —
character_set_serversystem variable (defaultutf8mb4in MySQL 8).
"A column-level specification will override a table-level specification, a table-level specification overrides the database default, and a database-level charset overrides the server default." (Source: sources/2026-04-21-planetscale-character-sets-and-collations-in-mysql.)
Worked example:
CREATE TABLE mixed_collations (
`explicitly_set` VARCHAR(255) CHARACTER SET latin1,
`implicitly_set` VARCHAR(255)
);
After SHOW CREATE TABLE:
CREATE TABLE `mixed_collations` (
`explicitly_set` VARCHAR(255) CHARACTER SET latin1 COLLATE latin1_swedish_ci DEFAULT NULL,
`implicitly_set` VARCHAR(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
The explicitly_set column carries latin1 (column-level
override); the implicitly_set column inherits the table
default (utf8mb4) which itself inherited from the server
default.
Canonical MySQL 8 default¶
utf8mb4 is the server default on MySQL 8.0+. This is a
deliberate fix: MySQL 5.7 and earlier defaulted to
latin1, which silently failed to store non-Latin
characters. For greenfield schemas on MySQL 8, the
PlanetScale recommendation (and widely-held community
consensus) is to accept the server default — full Unicode
coverage including supplementary-plane code points
(emoji, rare CJK, historical scripts). See
concepts/utf8mb4-vs-utf8 for the canonical trap on
why utf8 is not a safe Unicode choice in MySQL.
Diagnostic primitive: SHOW CREATE TABLE¶
The authoritative way to discover what charset and
collation a column actually has — as opposed to what you
think you declared — is SHOW CREATE TABLE. MySQL
resolves the inheritance hierarchy at table-creation time
and records the resolved values; SHOW CREATE TABLE
replays them verbatim. This is the first thing to check
when debugging charset-related bugs (encoding corruption,
unexpected equality comparison, collation-mismatch
errors during JOIN).
Seen in¶
- sources/2026-04-21-planetscale-character-sets-and-collations-in-mysql
— canonical wiki introduction of MySQL's character-set
concept, the
information_schema.character_setscatalogue, theMAXLENbyte-count-per-character property, the four-level inheritance hierarchy with most-specific-wins, and theSHOW CREATE TABLEdiagnostic primitive.