CONCEPT Cited by 1 source
LIKE predicate cardinality estimation failure¶
Definition¶
LIKE-predicate cardinality estimation failure is the
canonical blind spot of histogram-based query-planner
cardinality estimators:
string-pattern predicates (LIKE '%foo%', LIKE 'bar%', regex)
produce result-set sizes that equi-height histograms on the
column cannot meaningfully predict, because the histogram
captures value distribution but not substring-match
distribution.
Why it happens¶
A column's value histogram answers "how many rows have value
X?" or "how many rows have value in range [X, Y]?" It
cannot answer "how many rows contain substring S?" without
scanning or maintaining a fundamentally different structure
(suffix trees, n-gram indexes, sampled distributions of
substrings). Production optimizers do neither; they fall back
to fixed selectivity heuristics like "LIKE '%foo%' selects
3% of rows" — a guess disconnected from reality.
Query 5b as canonical illustration¶
From Databricks' join-ordering experiment (Source: sources/2026-04-22-databricks-are-llm-agents-good-at-join-order-optimization):
Query 5b of the Join Order
Benchmark is a 5-way join looking for "American production
companies that released a post-2010 movie on VHS with a note
referencing 1994." The VHS check and the 1994-reference check
are LIKE predicates on notes fields.
- Databricks default optimizer plan: First filter American VHS production companies — estimator says 12 rows (actually selective!). Then join outward.
- LLM-agent plan: First filter VHS releases referencing 1994, then join outward. "Significantly faster."
The default plan looks right by the numbers — 12 rows is very
selective. But the 1994-VHS predicate combination is even more
selective than the default estimator can predict, because both
predicates are LIKE patterns on notes fields. The agent's
execute-and-measure loop finds the superior plan because it
doesn't need to estimate — it observes.
From the post verbatim:
"This is because the query uses
LIKEpredicates to identify VHS releases, which are exceptionally difficult for cardinality estimators."
Why it matters for system design¶
- It's a structurally unfixable estimator bug. Adding histograms doesn't help. String-pattern selectivity requires fundamentally different statistics (e.g. sampled substring distributions, full-text indexes).
- It's a common-enough case to matter. Log-search, catalog notes, user comments, free-text metadata — the pattern appears wherever string columns hold semi- structured content.
- It's a canonical argument for
agent-based offline
tuning. The agent sidesteps cardinality estimation
entirely by measuring real runtimes, and the biggest
measured wins in the Databricks experiment are exactly on
LIKE-heavy queries.
Related estimator failure modes¶
- Correlated predicates.
state='CA' AND city='SF'under independence assumption. - Skewed distributions. Tail mass lost to histogram bucketing.
- Multi-join compounding. Estimation errors multiply down the plan tree.
Sibling concept¶
concepts/wildcard-prefix-non-sargable is the adjacent
problem: a LIKE '%prefix' can't use an index at all, a
different axis of failure. This page is about cost-model
failure, not index-access failure.
Seen in¶
- sources/2026-04-22-databricks-are-llm-agents-good-at-join-order-optimization — Canonical query 5b illustration. Explicitly named as "exceptionally difficult for cardinality estimators."
Related¶
- concepts/cardinality-estimation — the broader problem
- concepts/join-order-optimization — where this failure propagates
- concepts/query-planner — the containing system
- concepts/wildcard-prefix-non-sargable — sibling
index-access failure with
LIKE - systems/join-order-benchmark-job — the benchmark where 5b lives
- systems/databricks-join-order-agent — prototype that sidesteps the failure mode