Skip to content

CONCEPT Cited by 1 source

Collation

A collation is a set of rules for comparing and sorting strings. Every string character set in MySQL has at least one collation, and one of those is designated the default collation — used when a charset is declared without an explicit collation. Collations govern equality (=), ordering (ORDER BY, <, >), pattern matching (LIKE), and deduplication (DISTINCT, GROUP BY) semantics on string data.

Definition

From PlanetScale's Aaron Francis: "While character sets define the legal characters that can be stored in a column, collations are rules that determine how string comparisons are made. If you are sorting or comparing strings, MySQL uses the collation to decide the order and whether the strings are the same." A single character set can have multiple collations (different comparison rules for the same character repertoire), but a collation belongs to exactly one character set. (Source: sources/2026-04-21-planetscale-character-sets-and-collations-in-mysql.)

The suffix taxonomy

MySQL collations follow a naming scheme where the charset forms a prefix and the suffix encodes the comparison axes:

Suffix Meaning
_ai Accent-insensitive
_as Accent-sensitive
_ci Case-insensitive
_cs Case-sensitive
_ks Kana-sensitive (Japanese hiragana vs katakana distinction)
_bin Binary (byte-for-byte comparison)

These compose: utf8mb4_0900_ai_ci decomposes as utf8mb4 (character set) + 0900 (Unicode Collation Algorithm version 9.0.0 weight keys) + ai (accent-insensitive) + ci (case-insensitive). utf8mb4_0900_as_cs is the same UCA 9.0.0 weight keys but accent- and case-sensitive.

String comparison sensitivity is a collation property

The answer to "are string comparisons case-sensitive in MySQL?" is "it depends on the collation" — not on SQL, not on the charset, not on some server-wide mode. Demonstrated with the COLLATE cast operator:

-- Case-insensitive: "MySQL" = "mysql"
SELECT "MySQL" COLLATE utf8mb4_0900_ai_ci = "mysql" COLLATE utf8mb4_0900_ai_ci;  -- 1

-- Case-sensitive: "MySQL" != "mysql"
SELECT "MySQL" COLLATE utf8mb4_0900_as_cs = "mysql" COLLATE utf8mb4_0900_as_cs;  -- 0

Same strings, same characters, opposite answers, determined entirely by collation choice. The same logic applies to accents: under _ai collations, résumé = resume; under _as, they are distinct.

The inheritance hierarchy

Collations follow the same four-level hierarchy as character sets, with the most-specific winning: column > table > database > server default. If a collation is not explicitly declared at any level, MySQL falls back to the default collation of the character set at whichever level owns the charset declaration. So a table declared DEFAULT CHARSET=utf8mb4 with no explicit COLLATE clause will use utf8mb4_0900_ai_ci (the DEFAULT_COLLATE_NAME for utf8mb4 on MySQL 8). (Source: sources/2026-04-21-planetscale-character-sets-and-collations-in-mysql.)

Table-level override:

CREATE TABLE table_with_collation (
    my_column VARCHAR(255)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;

Column-level override + mixed inheritance:

CREATE TABLE table_with_collation (
    `explicitly_set` VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
    `implicitly_set` VARCHAR(255)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
-- explicitly_set: utf8mb4_general_ci (column override)
-- implicitly_set: utf8mb4_0900_ai_ci (table charset default collation)

Runtime COLLATE cast

Collations can be overridden per-query via the COLLATE operator, not just in schema declarations. This is useful for case-sensitive searches on case-insensitive columns, or vice versa, without having to add a separate column or change the schema:

SELECT * FROM users WHERE username COLLATE utf8mb4_0900_as_cs = 'Alice';

UCA versions

Modern MySQL 8 collations are prefixed with UCA (Unicode Collation Algorithm) version keys — 0900 refers to UCA 9.0.0 weight keys which govern the language-specific ordering and equivalence rules. Upgrading UCA version changes the definition of "equal" or "less than" for some code points; MySQL 5.7's default utf8mb4_general_ci pre-dates UCA-versioned collations and uses a simpler (and in places wrong) language-agnostic rule.

Language-specific UCA collations exist alongside the generic utf8mb4_0900_ai_ci — e.g. utf8mb4_cs_0900_ai_ci (Czech), utf8mb4_da_0900_ai_cs (Danish, case-sensitive), utf8mb4_zh_0900_as_cs (Chinese, accent + case sensitive). These implement locale-accurate sort orders for user-facing data.

Binary collation

utf8mb4_bin (and the older utf8mb4_bin variant with PAD_ATTRIBUTE=PAD SPACE) compares strings byte-for-byte. Useful when you want strict byte equality (e.g. storing tokens, hashes, or case-sensitive identifiers where any case-folding would be a bug) and want to opt out of Unicode normalisation / case-folding / accent-folding entirely.

Operational gotchas

  • JOIN between columns with different collations fails with ERROR 1267 (Illegal mix of collations). This is MySQL refusing to guess which collation should govern the comparison. Fix: COLLATE cast one side, or align collations at schema-design time.
  • GROUP BY deduplicates under collation rules — under _ai_ci, GROUP BY name collapses MySQL, mysql, and MYSQL into a single group. Under _bin, they are three distinct groups.
  • Index ordering is collation-aware — an index on a column with utf8mb4_0900_ai_ci cannot be used to accelerate comparisons under utf8mb4_bin, and vice versa. Collation mismatch between query and index forces a full scan.

Seen in

  • sources/2026-04-21-planetscale-character-sets-and-collations-in-mysql — canonical wiki introduction of MySQL collations, the _ai / _as / _ci / _cs / _ks / _bin suffix taxonomy, the COLLATE cast operator, the utf8mb4_0900_ai_ci default on MySQL 8 with UCA 9.0.0 weight keys, and demonstration that string-comparison sensitivity is a collation property rather than a SQL-standard property.
Last updated · 347 distilled / 1,201 read