Skip to content

CONCEPT Cited by 1 source

LIKE predicate cardinality estimation failure

Definition

LIKE-predicate cardinality estimation failure is the canonical blind spot of histogram-based query-planner cardinality estimators: string-pattern predicates (LIKE '%foo%', LIKE 'bar%', regex) produce result-set sizes that equi-height histograms on the column cannot meaningfully predict, because the histogram captures value distribution but not substring-match distribution.

Why it happens

A column's value histogram answers "how many rows have value X?" or "how many rows have value in range [X, Y]?" It cannot answer "how many rows contain substring S?" without scanning or maintaining a fundamentally different structure (suffix trees, n-gram indexes, sampled distributions of substrings). Production optimizers do neither; they fall back to fixed selectivity heuristics like "LIKE '%foo%' selects 3% of rows" — a guess disconnected from reality.

Query 5b as canonical illustration

From Databricks' join-ordering experiment (Source: sources/2026-04-22-databricks-are-llm-agents-good-at-join-order-optimization):

Query 5b of the Join Order Benchmark is a 5-way join looking for "American production companies that released a post-2010 movie on VHS with a note referencing 1994." The VHS check and the 1994-reference check are LIKE predicates on notes fields.

  • Databricks default optimizer plan: First filter American VHS production companies — estimator says 12 rows (actually selective!). Then join outward.
  • LLM-agent plan: First filter VHS releases referencing 1994, then join outward. "Significantly faster."

The default plan looks right by the numbers — 12 rows is very selective. But the 1994-VHS predicate combination is even more selective than the default estimator can predict, because both predicates are LIKE patterns on notes fields. The agent's execute-and-measure loop finds the superior plan because it doesn't need to estimate — it observes.

From the post verbatim:

"This is because the query uses LIKE predicates to identify VHS releases, which are exceptionally difficult for cardinality estimators."

Why it matters for system design

  1. It's a structurally unfixable estimator bug. Adding histograms doesn't help. String-pattern selectivity requires fundamentally different statistics (e.g. sampled substring distributions, full-text indexes).
  2. It's a common-enough case to matter. Log-search, catalog notes, user comments, free-text metadata — the pattern appears wherever string columns hold semi- structured content.
  3. It's a canonical argument for agent-based offline tuning. The agent sidesteps cardinality estimation entirely by measuring real runtimes, and the biggest measured wins in the Databricks experiment are exactly on LIKE-heavy queries.
  • Correlated predicates. state='CA' AND city='SF' under independence assumption.
  • Skewed distributions. Tail mass lost to histogram bucketing.
  • Multi-join compounding. Estimation errors multiply down the plan tree.

Sibling concept

concepts/wildcard-prefix-non-sargable is the adjacent problem: a LIKE '%prefix' can't use an index at all, a different axis of failure. This page is about cost-model failure, not index-access failure.

Seen in

Last updated · 510 distilled / 1,221 read