CONCEPT Cited by 1 source
utf8mb4 vs utf8 (the MySQL UTF-8 trap)¶
In MySQL, the
character set named utf8 is not actually UTF-8. It
is a three-byte-maximum encoding that covers only Unicode's
Basic Multilingual Plane (BMP) and silently fails on any
code point requiring four UTF-8 bytes โ which includes all
emoji, many supplementary-plane CJK characters, and
historical scripts. The real UTF-8 character set in MySQL
is called utf8mb4 (the mb4 suffix standing for
"maximum 4 bytes"), which is the MySQL 8 default.
The core claim¶
From PlanetScale's Aaron Francis: "According to the UTF-8
spec, each character is allowed four bytes, meaning MySQL's
utf8 charset was never actually UTF-8 since it only
supported three bytes per character. In MySQL 8, utf8mb4
is the default character set and the one you will use most
often. utf8 is left for backwards compatibility and
should no longer be used."
(Source: sources/2026-04-21-planetscale-character-sets-and-collations-in-mysql.)
MAXLEN comparison¶
information_schema.character_sets shows both entries side
by side:
| CHARACTER_SET_NAME | DEFAULT_COLLATE_NAME | DESCRIPTION | MAXLEN |
|---|---|---|---|
utf8 |
utf8_general_ci |
UTF-8 Unicode | 3 |
utf8mb4 |
utf8mb4_0900_ai_ci |
UTF-8 Unicode | 4 |
Same DESCRIPTION; different MAXLEN. The MAXLEN=3 on
utf8 is the entire bug: real UTF-8 requires up to 4 bytes
per character to cover the full Unicode range (U+0000
through U+10FFFF); MySQL's utf8 stops at U+FFFF โ the top
of the Basic Multilingual Plane.
What utf8 silently excludes¶
The 4-byte UTF-8 range includes all of Unicode's supplementary planes (planes 1 through 16), which contain:
- All emoji โ ๐, ๐, ๐, ๐ฉ, etc. (most emoji are in the Supplementary Multilingual Plane U+1F000+).
- CJK Unified Ideographs Extension B through G (U+20000โU+3134F) โ thousands of rare Chinese, Japanese, and Korean characters used in proper names, historical texts, and specialised publications.
- Historical scripts โ Egyptian Hieroglyphs (U+13000), Linear B (U+10000), Cuneiform (U+12000), Old Italic, etc.
- Mathematical alphanumeric symbols (U+1D400+).
- Music notation (U+1D100+).
An attempt to INSERT any of these into a utf8-typed
column produces either an Incorrect string value error
(strict SQL mode) or silent truncation at the first
4-byte character (default / non-strict mode) โ the data is
corrupted on write.
Why this happened (historical note)¶
MySQL's utf8 was implemented in 2002, when Unicode was
still largely in the BMP range and the 4-byte UTF-8
sequences were rare. MySQL committed to a fixed 3-byte
utf8 encoding for index-length budgeting and storage
predictability. When the Unicode repertoire expanded and
emoji took off, adding 4-byte support as a breaking change
to utf8 would have invalidated every existing schema.
Instead MySQL added a new character set named
utf8mb4 in MySQL 5.5.3 (2010) and left utf8 as a
three-byte encoding for backward compatibility. Changing
the default from latin1 โ utf8mb4 took until MySQL 8.0
(2018).
The MySQL 8 fix¶
On MySQL 8.0+, the server default
(character_set_server) is utf8mb4 and its default
collation is utf8mb4_0900_ai_ci. Creating a table
without any charset declaration on MySQL 8 produces a
safe, fully-Unicode table:
CREATE TABLE no_charset (my_column VARCHAR(255));
SHOW CREATE TABLE no_charset;
-- ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
The greenfield-schema recommendation is to accept the
MySQL 8 default and not declare CHARSET=utf8 explicitly
anywhere.
Migration hazard for legacy schemas¶
Tables created on MySQL 5.7 or earlier that declared
CHARSET=utf8 retain that declaration across server
upgrade โ the 3-byte charset doesn't automatically upgrade
when the server does. Migrating a legacy utf8 table to
utf8mb4 requires ALTER TABLE ... CONVERT TO CHARACTER
SET utf8mb4 COLLATE utf8mb4_0900_ai_ci, which in a naive
implementation rewrites the entire table โ a migration
cost proportional to table size and a source of
production incidents at scale.
PlanetScale's Online DDL engine in
Vitess 21 specifically added
programmatic text conversion for charset changes "rather
than MySQL's CONVERT(... USING utf8mb4)" to improve
performance on primary-key / iteration-key columns โ
canonicalised in the 2026-04-21 Announcing Vitess 21
post as the shipping primitive for charset-change Online
DDL. The PlanetScale engineering investment in this path
is itself evidence that the utf8 โ utf8mb4 migration
is a common enough production operation to merit
first-class tooling.
(Source: sources/2026-04-21-planetscale-announcing-vitess-21.)
Index-size considerations¶
Moving from utf8 (MAXLEN=3) to utf8mb4 (MAXLEN=4) for
indexed string columns changes the byte budget for
indexes. On older MySQL versions with a 767-byte
per-index-column limit (InnoDB ROW_FORMAT=COMPACT
without innodb_large_prefix), utf8's MAXLEN=3
supports VARCHAR(255) as a full index (255 * 3 = 765
bytes, fits under 767). utf8mb4's MAXLEN=4 hits the
ceiling at VARCHAR(191) (191 * 4 = 764 bytes) โ so
migrating schemas often required shortening indexed
columns to VARCHAR(191) or switching to
ROW_FORMAT=DYNAMIC + innodb_large_prefix=1 to raise
the limit to 3,072 bytes. Modern MySQL 8 uses
ROW_FORMAT=DYNAMIC by default so the ceiling issue is
gone on new schemas โ but it's preserved on legacy schemas
imported from older MySQL versions.
The utf8mb3 alias¶
MySQL 8.0.29+ introduces utf8mb3 as an explicit alias
for the 3-byte utf8 character set, deprecating the
ambiguous utf8 name. The eventual plan (not yet
executed as of 2026) is for utf8 to become an alias
for utf8mb4 instead โ restoring the naming intuition.
Applications relying on utf8 being 3-byte should be
updated to explicitly use utf8mb3.
Production checklist¶
- Greenfield MySQL 8 schema: accept the default
(
utf8mb4). Don't declareCHARSET=utf8anywhere. - Any existing schema: audit with
SELECT table_schema, table_name, character_set_name FROM information_schema.columns WHERE character_set_name = 'utf8'. Plan migration toutf8mb4for any column that may receive user-generated content. - Before inserting emoji or supplementary-plane
characters: verify the target column is
utf8mb4, notutf8. A silently-truncatingINSERTwill destroy data before the application notices. - Connection charset: the client connection also has
a charset (
character_set_client,character_set_connection,character_set_results). Ensure all three areutf8mb4โ a mismatched connection charset produces mojibake even on a fully-utf8mb4schema. Typical bug: server-side columns areutf8mb4but the JDBC driver negotiatedutf8โ emoji sent from the app arrive as???at the database.
Classification¶
This is a canonical
backward-compatibility failure mode at the vendor-naming
altitude: utf8 has a compelling correct name for the
underlying internet standard, but the implementation
doesn't match the name. The cost of fixing the name (by
redefining what utf8 means) would have silently broken
every existing schema, so MySQL chose to leave the
misnomer in place and introduce a differently-named
replacement. The result is ~10 years of new MySQL users
tripping over "Why doesn't my emoji save?" before they
discover the correct charset.
Seen in¶
- sources/2026-04-21-planetscale-character-sets-and-collations-in-mysql
โ canonical wiki introduction of the
utf8vsutf8mb4trap:MAXLEN=3vsMAXLEN=4, the supplementary-plane coverage gap, MySQL 8 makingutf8mb4the default, and the explicit recommendation "utf8should no longer be used." - sources/2026-04-21-planetscale-announcing-vitess-21
โ shipping primitive: Vitess 21's Online DDL engine
adds programmatic text conversion for charset changes
rather than relying on MySQL's
CONVERT(... USING utf8mb4), specifically to improve performance in primary-key / iteration-key columns duringutf8โutf8mb4migrations.