CONCEPT Cited by 1 source
Cardinality estimation¶
Definition¶
Cardinality estimation is the query-optimizer task of predicting the number of rows a sub-plan will produce before executing it. The estimate feeds the cost model, which feeds plan selection: get the estimate wrong and the optimizer picks a plan that looks cheap but runs slowly.
Why it's the weakest link¶
From Databricks' join-ordering experiment (Source: sources/2026-04-22-databricks-are-llm-agents-good-at-join-order-optimization):
"Estimating this quantity [subquery size] is as difficult as executing the query itself (in general). … Cardinality estimation is especially difficult, and has led to a wide range of research seeking to improve estimation accuracy."
The key phrase is "as difficult as executing the query itself." Estimators trade accuracy for speed — they must deliver an answer within the optimizer's sub-second budget, so they rely on pre-computed statistics (histograms, distinct-value counts, sample-based joins) that capture only approximate facts about data.
Approaches research has explored¶
All cited by the Databricks post:
- Adaptive feedback. Re-estimate after seeing actual intermediate sizes during execution.
- Deep learning. Learn the distribution conditional on query features.
- Distribution modelling. Parametric families fit to data.
- Database theory. Bound-based approaches from combinatorial structure.
- Learning theory. PAC-style guarantees on estimation error.
- Factor decompositions. Matrix-style decompositions of joint distributions.
Adjacent to estimation replacement: reinforcement learning over plans, multi-armed bandits for plan selection, Bayesian optimization over plans (see systems/bayesqo).
Canonical failure modes¶
-
LIKEpredicates. String-pattern selectivity is fundamentally hard for histogram-based estimators. - Correlated predicates.
state='CA' AND city='SF'— naive independence assumption is wildly wrong. - Multi-way joins. Errors compound exponentially down the plan tree.
- Skewed distributions. Histogram bucketing loses tail mass.
Why LLM-agent-as-optimizer sidesteps the problem¶
Databricks' experiment demonstrates that if you can execute candidate plans offline and measure actual runtimes, you can skip cardinality estimation entirely:
"When using a frontier model, the agent was able to improve query latency by a factor of 1.288 (geomean). This outperforms using perfect cardinality estimates (intractable in practice)."
The agent's execute-and-measure loop beats an oracle that the cost model merely approximates. The estimator layer becomes unnecessary when the search procedure can observe ground truth.
This is the key insight underneath concepts/llm-agent-as-query-optimizer and concepts/offline-query-tuning-loop: trade latency for correctness, use direct measurement, accept that it can't ship in the optimizer hot path.
Seen in¶
- sources/2026-04-22-databricks-are-llm-agents-good-at-join-order-optimization — Names cardinality estimation as the hardest of three query-optimization components; canonicalises "as difficult as executing the query itself." Demonstrates that offline execute-and-measure beats even perfect estimates.
Related¶
- concepts/query-planner — estimation feeds plan selection
- concepts/join-order-optimization — most sensitive to estimation errors
- concepts/like-predicate-cardinality-estimation-failure — canonical hard case
- concepts/index-cardinality — adjacent statistic at the index level
- concepts/llm-agent-as-query-optimizer — sidesteps estimation by measuring directly