Skip to content

CONCEPT Cited by 1 source

Skewed column selectivity

Skewed column selectivity is the condition where a column's selectivity computed on average (across the whole table) is a poor predictor of selectivity for specific predicate values. An index on a skewed column can be highly useful for one range of values and useless for another — a structural pitfall in both index design and query-planner reasoning.

(Source: sources/2026-04-21-planetscale-why-isnt-mysql-using-my-index.)

The canonical worked example

Consider a table of one million users, where 99% are of type = "user" and 1% are of type = "admin". In this case, an index on type might seem useless because it's not very selective on average. But when you're querying for admins, it is highly selective. So while checking average selectivity is a good rule of thumb, pay careful attention to unevenly distributed data.

— Aaron Francis, Why isn't MySQL using my index?

Running the numbers on a 1,000,000-row table with 2 distinct values:

  • Average selectivity: COUNT(DISTINCT type) / COUNT(*) = 2 / 1,000,000 = 0.000002. Essentially zero.
  • Selectivity for type = 'admin': the index walk returns ~10,000 rows (1% of 1M) out of 1M → narrowing ratio 0.01. Highly useful for filtering.
  • Selectivity for type = 'user': the index walk returns ~990,000 rows out of 1M → narrowing ratio 0.99. Worse than a table scan when you have to walk the index then walk the clustered index for each row.

Why the planner gets it wrong

The MySQL optimiser uses the stored index statistics (sampled cardinality) to estimate selectivity at plan time. The statistics are a single average number per index, not a histogram of value frequencies. So for a skewed column the optimiser either:

  • Underuses the index (thinking it's not selective enough on average, ignoring the rare-value win).
  • Overuses the index (thinking it's selective enough on average, losing on the common-value path).

Modern MySQL (≥8.0) supports histogram statistics via ANALYZE TABLE ... UPDATE HISTOGRAM ON which records a value-frequency distribution — addressing skewed-column planning specifically. Aaron Francis's 2023 post predates widespread histogram adoption and does not cite the mechanism.

Mitigation patterns

Partial index (Postgres; not native MySQL)

PostgreSQL supports CREATE INDEX ... WHERE type = 'admin' — index only the rare values:

-- PostgreSQL
CREATE INDEX users_admin_idx ON users(type) WHERE type = 'admin';

Storage-efficient (only 1% of rows indexed) and perfectly selective for admin lookups. MySQL has no native partial- index syntax; the workaround is a functional/generated- column trick.

Functional index on the narrowing condition (MySQL 8.0.13+)

-- Index only NOT NULL admin rows via a generated column
ALTER TABLE users
  ADD COLUMN admin_marker TINYINT
    GENERATED ALWAYS AS (CASE WHEN type = 'admin' THEN 1 END) STORED,
  ADD INDEX (admin_marker);

Rows where type = 'user' get NULL in admin_marker and (depending on MySQL version) are not indexed in the secondary B+tree — approximating the partial-index behaviour. See concepts/functional-index-mysql for the full mechanism.

Histogram statistics (MySQL 8.0+)

ANALYZE TABLE users UPDATE HISTOGRAM ON type WITH 16 BUCKETS;

Gives the optimiser a value-frequency distribution it can use to distinguish type = 'admin' (rare, use index) from type = 'user' (common, prefer scan). Independent of the index — histograms work even on unindexed columns.

Separate the rare table

If the skew is severe and persistent, a vertical split into users_admin and users_regular tables (with a union-view for general queries) trades write-path complexity for predictable query plans.

When to apply the caveat

Run a distribution query before trusting an index on a categorical column:

SELECT type, COUNT(*) AS cnt,
       COUNT(*) / (SELECT COUNT(*) FROM users) AS fraction
FROM users
GROUP BY type
ORDER BY cnt DESC;

Any column where the top-N values cover >80-90% of rows is a skew candidate. Feature flags, status enums (active / deleted / pending), rare-event markers (is_fraudulent, is_banned), tenant IDs in multi-tenant tables where one tenant dominates.

Seen in

Last updated · 470 distilled / 1,213 read