Skip to content

CONCEPT Cited by 1 source

utf8mb4 vs utf8 (the MySQL UTF-8 trap)

In MySQL, the character set named utf8 is not actually UTF-8. It is a three-byte-maximum encoding that covers only Unicode's Basic Multilingual Plane (BMP) and silently fails on any code point requiring four UTF-8 bytes โ€” which includes all emoji, many supplementary-plane CJK characters, and historical scripts. The real UTF-8 character set in MySQL is called utf8mb4 (the mb4 suffix standing for "maximum 4 bytes"), which is the MySQL 8 default.

The core claim

From PlanetScale's Aaron Francis: "According to the UTF-8 spec, each character is allowed four bytes, meaning MySQL's utf8 charset was never actually UTF-8 since it only supported three bytes per character. In MySQL 8, utf8mb4 is the default character set and the one you will use most often. utf8 is left for backwards compatibility and should no longer be used." (Source: sources/2026-04-21-planetscale-character-sets-and-collations-in-mysql.)

MAXLEN comparison

information_schema.character_sets shows both entries side by side:

CHARACTER_SET_NAME DEFAULT_COLLATE_NAME DESCRIPTION MAXLEN
utf8 utf8_general_ci UTF-8 Unicode 3
utf8mb4 utf8mb4_0900_ai_ci UTF-8 Unicode 4

Same DESCRIPTION; different MAXLEN. The MAXLEN=3 on utf8 is the entire bug: real UTF-8 requires up to 4 bytes per character to cover the full Unicode range (U+0000 through U+10FFFF); MySQL's utf8 stops at U+FFFF โ€” the top of the Basic Multilingual Plane.

What utf8 silently excludes

The 4-byte UTF-8 range includes all of Unicode's supplementary planes (planes 1 through 16), which contain:

  • All emoji โ€” ๐Ÿ˜€, ๐ŸŽ‰, ๐Ÿš€, ๐Ÿ’ฉ, etc. (most emoji are in the Supplementary Multilingual Plane U+1F000+).
  • CJK Unified Ideographs Extension B through G (U+20000โ€“U+3134F) โ€” thousands of rare Chinese, Japanese, and Korean characters used in proper names, historical texts, and specialised publications.
  • Historical scripts โ€” Egyptian Hieroglyphs (U+13000), Linear B (U+10000), Cuneiform (U+12000), Old Italic, etc.
  • Mathematical alphanumeric symbols (U+1D400+).
  • Music notation (U+1D100+).

An attempt to INSERT any of these into a utf8-typed column produces either an Incorrect string value error (strict SQL mode) or silent truncation at the first 4-byte character (default / non-strict mode) โ€” the data is corrupted on write.

Why this happened (historical note)

MySQL's utf8 was implemented in 2002, when Unicode was still largely in the BMP range and the 4-byte UTF-8 sequences were rare. MySQL committed to a fixed 3-byte utf8 encoding for index-length budgeting and storage predictability. When the Unicode repertoire expanded and emoji took off, adding 4-byte support as a breaking change to utf8 would have invalidated every existing schema. Instead MySQL added a new character set named utf8mb4 in MySQL 5.5.3 (2010) and left utf8 as a three-byte encoding for backward compatibility. Changing the default from latin1 โ†’ utf8mb4 took until MySQL 8.0 (2018).

The MySQL 8 fix

On MySQL 8.0+, the server default (character_set_server) is utf8mb4 and its default collation is utf8mb4_0900_ai_ci. Creating a table without any charset declaration on MySQL 8 produces a safe, fully-Unicode table:

CREATE TABLE no_charset (my_column VARCHAR(255));
SHOW CREATE TABLE no_charset;
-- ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

The greenfield-schema recommendation is to accept the MySQL 8 default and not declare CHARSET=utf8 explicitly anywhere.

Migration hazard for legacy schemas

Tables created on MySQL 5.7 or earlier that declared CHARSET=utf8 retain that declaration across server upgrade โ€” the 3-byte charset doesn't automatically upgrade when the server does. Migrating a legacy utf8 table to utf8mb4 requires ALTER TABLE ... CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci, which in a naive implementation rewrites the entire table โ€” a migration cost proportional to table size and a source of production incidents at scale.

PlanetScale's Online DDL engine in Vitess 21 specifically added programmatic text conversion for charset changes "rather than MySQL's CONVERT(... USING utf8mb4)" to improve performance on primary-key / iteration-key columns โ€” canonicalised in the 2026-04-21 Announcing Vitess 21 post as the shipping primitive for charset-change Online DDL. The PlanetScale engineering investment in this path is itself evidence that the utf8 โ†’ utf8mb4 migration is a common enough production operation to merit first-class tooling. (Source: sources/2026-04-21-planetscale-announcing-vitess-21.)

Index-size considerations

Moving from utf8 (MAXLEN=3) to utf8mb4 (MAXLEN=4) for indexed string columns changes the byte budget for indexes. On older MySQL versions with a 767-byte per-index-column limit (InnoDB ROW_FORMAT=COMPACT without innodb_large_prefix), utf8's MAXLEN=3 supports VARCHAR(255) as a full index (255 * 3 = 765 bytes, fits under 767). utf8mb4's MAXLEN=4 hits the ceiling at VARCHAR(191) (191 * 4 = 764 bytes) โ€” so migrating schemas often required shortening indexed columns to VARCHAR(191) or switching to ROW_FORMAT=DYNAMIC + innodb_large_prefix=1 to raise the limit to 3,072 bytes. Modern MySQL 8 uses ROW_FORMAT=DYNAMIC by default so the ceiling issue is gone on new schemas โ€” but it's preserved on legacy schemas imported from older MySQL versions.

The utf8mb3 alias

MySQL 8.0.29+ introduces utf8mb3 as an explicit alias for the 3-byte utf8 character set, deprecating the ambiguous utf8 name. The eventual plan (not yet executed as of 2026) is for utf8 to become an alias for utf8mb4 instead โ€” restoring the naming intuition. Applications relying on utf8 being 3-byte should be updated to explicitly use utf8mb3.

Production checklist

  • Greenfield MySQL 8 schema: accept the default (utf8mb4). Don't declare CHARSET=utf8 anywhere.
  • Any existing schema: audit with SELECT table_schema, table_name, character_set_name FROM information_schema.columns WHERE character_set_name = 'utf8'. Plan migration to utf8mb4 for any column that may receive user-generated content.
  • Before inserting emoji or supplementary-plane characters: verify the target column is utf8mb4, not utf8. A silently-truncating INSERT will destroy data before the application notices.
  • Connection charset: the client connection also has a charset (character_set_client, character_set_connection, character_set_results). Ensure all three are utf8mb4 โ€” a mismatched connection charset produces mojibake even on a fully-utf8mb4 schema. Typical bug: server-side columns are utf8mb4 but the JDBC driver negotiated utf8 โ€” emoji sent from the app arrive as ??? at the database.

Classification

This is a canonical backward-compatibility failure mode at the vendor-naming altitude: utf8 has a compelling correct name for the underlying internet standard, but the implementation doesn't match the name. The cost of fixing the name (by redefining what utf8 means) would have silently broken every existing schema, so MySQL chose to leave the misnomer in place and introduce a differently-named replacement. The result is ~10 years of new MySQL users tripping over "Why doesn't my emoji save?" before they discover the correct charset.

Seen in

  • sources/2026-04-21-planetscale-character-sets-and-collations-in-mysql โ€” canonical wiki introduction of the utf8 vs utf8mb4 trap: MAXLEN=3 vs MAXLEN=4, the supplementary-plane coverage gap, MySQL 8 making utf8mb4 the default, and the explicit recommendation "utf8 should no longer be used."
  • sources/2026-04-21-planetscale-announcing-vitess-21 โ€” shipping primitive: Vitess 21's Online DDL engine adds programmatic text conversion for charset changes rather than relying on MySQL's CONVERT(... USING utf8mb4), specifically to improve performance in primary-key / iteration-key columns during utf8 โ†’ utf8mb4 migrations.
Last updated ยท 347 distilled / 1,201 read