Skip to content

CONCEPT Cited by 1 source

Character set

A character set defines the set of characters that can legally be stored in a string column, plus the byte-length encoding for each character. In MySQL, every string column (CHAR, VARCHAR, TEXT, ENUM, SET) has exactly one character set assigned to it — either explicitly declared or inherited from the table, database, or server default.

Definition

From PlanetScale's Aaron Francis: "A character set defines the characters allowed to go in a column. A collation is a set of rules for comparing those characters." The two settings are paired — every character set has at least one collation, and one designated DEFAULT_COLLATE_NAME — but they govern different operations (storage vs comparison) and can be overridden independently. (Source: sources/2026-04-21-planetscale-character-sets-and-collations-in-mysql.)

The information_schema.character_sets catalogue

MySQL ships 41 character sets, each with a default collation, a description, and a MAXLEN — the maximum number of bytes per character:

SELECT * FROM information_schema.character_sets
ORDER BY character_set_name;
CHARACTER_SET_NAME DEFAULT_COLLATE_NAME DESCRIPTION MAXLEN
ascii ascii_general_ci US ASCII 1
binary binary Binary pseudo 1
latin1 latin1_swedish_ci cp1252 West Euro 1
utf8 utf8_general_ci UTF-8 Unicode 3
utf8mb4 utf8mb4_0900_ai_ci UTF-8 Unicode 4
ucs2 ucs2_general_ci UCS-2 Unicode 2
utf16 utf16_general_ci UTF-16 Unicode 4

Two canonical groupings:

  • Single-byte charsets (ascii, latin1, cp1250, the various DOS and Mac codepages): MAXLEN=1 — storage is exactly one byte per character, so a VARCHAR(255) is exactly 255 bytes.
  • Unicode charsets (utf8, utf8mb4, ucs2, utf16, utf32): MAXLEN ≥ 2 — storage is variable-length (utf8, utf8mb4) or fixed wider-than-byte (ucs2 2B, utf32 4B). A VARCHAR(255) in utf8mb4 can need up to 1,020 bytes.

MAXLEN drives storage + index sizing

The MAXLEN is load-bearing for schema design because MySQL's on-disk + index page-size limits are measured in bytes, not characters. InnoDB's default max index key length is 3,072 bytes; a VARCHAR(768) column in utf8mb4 (MAXLEN=4) hits that ceiling at 3,072 bytes and cannot be indexed without a prefix, while the same column in latin1 (MAXLEN=1) consumes only 768 bytes and indexes fit easily. This is why migration from latin1utf8mb4 sometimes requires index-length adjustments (KEY (col(191)) instead of KEY (col)) in older MySQL versions.

The inheritance hierarchy

Character sets are declared at four levels, with the most-specific winning:

  1. Column-levelVARCHAR(255) CHARACTER SET latin1 on an individual column.
  2. Table-levelCREATE TABLE t (...) DEFAULT CHARSET=latin1 applies to all string columns without an explicit column-level charset.
  3. Database-levelALTER DATABASE d DEFAULT CHARACTER SET utf8mb4 applies to all tables in database d without an explicit table-level charset.
  4. Server defaultcharacter_set_server system variable (default utf8mb4 in MySQL 8).

"A column-level specification will override a table-level specification, a table-level specification overrides the database default, and a database-level charset overrides the server default." (Source: sources/2026-04-21-planetscale-character-sets-and-collations-in-mysql.)

Worked example:

CREATE TABLE mixed_collations (
    `explicitly_set` VARCHAR(255) CHARACTER SET latin1,
    `implicitly_set` VARCHAR(255)
);

After SHOW CREATE TABLE:

CREATE TABLE `mixed_collations` (
  `explicitly_set` VARCHAR(255) CHARACTER SET latin1 COLLATE latin1_swedish_ci DEFAULT NULL,
  `implicitly_set` VARCHAR(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

The explicitly_set column carries latin1 (column-level override); the implicitly_set column inherits the table default (utf8mb4) which itself inherited from the server default.

Canonical MySQL 8 default

utf8mb4 is the server default on MySQL 8.0+. This is a deliberate fix: MySQL 5.7 and earlier defaulted to latin1, which silently failed to store non-Latin characters. For greenfield schemas on MySQL 8, the PlanetScale recommendation (and widely-held community consensus) is to accept the server default — full Unicode coverage including supplementary-plane code points (emoji, rare CJK, historical scripts). See concepts/utf8mb4-vs-utf8 for the canonical trap on why utf8 is not a safe Unicode choice in MySQL.

Diagnostic primitive: SHOW CREATE TABLE

The authoritative way to discover what charset and collation a column actually has — as opposed to what you think you declared — is SHOW CREATE TABLE. MySQL resolves the inheritance hierarchy at table-creation time and records the resolved values; SHOW CREATE TABLE replays them verbatim. This is the first thing to check when debugging charset-related bugs (encoding corruption, unexpected equality comparison, collation-mismatch errors during JOIN).

Seen in

Last updated · 347 distilled / 1,201 read