CONCEPT Cited by 1 source

LIKE predicate cardinality estimation failure¶

Definition¶

LIKE-predicate cardinality estimation failure is the canonical blind spot of histogram-based query-planner cardinality estimators: string-pattern predicates (LIKE '%foo%', LIKE 'bar%', regex) produce result-set sizes that equi-height histograms on the column cannot meaningfully predict, because the histogram captures value distribution but not substring-match distribution.

Why it happens¶

A column's value histogram answers "how many rows have value X?" or "how many rows have value in range [X, Y]?" It cannot answer "how many rows contain substring S?" without scanning or maintaining a fundamentally different structure (suffix trees, n-gram indexes, sampled distributions of substrings). Production optimizers do neither; they fall back to fixed selectivity heuristics like "LIKE '%foo%' selects 3% of rows" — a guess disconnected from reality.

Query 5b as canonical illustration¶

From Databricks' join-ordering experiment (Source: sources/2026-04-22-databricks-are-llm-agents-good-at-join-order-optimization):

Query 5b of the Join Order Benchmark is a 5-way join looking for "American production companies that released a post-2010 movie on VHS with a note referencing 1994." The VHS check and the 1994-reference check are LIKE predicates on notes fields.

Databricks default optimizer plan: First filter American VHS production companies — estimator says 12 rows (actually selective!). Then join outward.
LLM-agent plan: First filter VHS releases referencing 1994, then join outward. "Significantly faster."

The default plan looks right by the numbers — 12 rows is very selective. But the 1994-VHS predicate combination is even more selective than the default estimator can predict, because both predicates are LIKE patterns on notes fields. The agent's execute-and-measure loop finds the superior plan because it doesn't need to estimate — it observes.

From the post verbatim:

"This is because the query uses LIKE predicates to identify VHS releases, which are exceptionally difficult for cardinality estimators."

Why it matters for system design¶

It's a structurally unfixable estimator bug. Adding histograms doesn't help. String-pattern selectivity requires fundamentally different statistics (e.g. sampled substring distributions, full-text indexes).
It's a common-enough case to matter. Log-search, catalog notes, user comments, free-text metadata — the pattern appears wherever string columns hold semi- structured content.
It's a canonical argument for agent-based offline tuning. The agent sidesteps cardinality estimation entirely by measuring real runtimes, and the biggest measured wins in the Databricks experiment are exactly on LIKE-heavy queries.

Correlated predicates. state='CA' AND city='SF' under independence assumption.
Skewed distributions. Tail mass lost to histogram bucketing.
Multi-join compounding. Estimation errors multiply down the plan tree.

Sibling concept¶

concepts/wildcard-prefix-non-sargable is the adjacent problem: a LIKE '%prefix' can't use an index at all, a different axis of failure. This page is about cost-model failure, not index-access failure.

Seen in¶

sources/2026-04-22-databricks-are-llm-agents-good-at-join-order-optimization — Canonical query 5b illustration. Explicitly named as "exceptionally difficult for cardinality estimators."

concepts/cardinality-estimation — the broader problem
concepts/join-order-optimization — where this failure propagates
concepts/query-planner — the containing system
concepts/wildcard-prefix-non-sargable — sibling index-access failure with LIKE
systems/join-order-benchmark-job — the benchmark where 5b lives
systems/databricks-join-order-agent — prototype that sidesteps the failure mode