Skip to content

CONCEPT Cited by 1 source

Cardinality estimation

Definition

Cardinality estimation is the query-optimizer task of predicting the number of rows a sub-plan will produce before executing it. The estimate feeds the cost model, which feeds plan selection: get the estimate wrong and the optimizer picks a plan that looks cheap but runs slowly.

From Databricks' join-ordering experiment (Source: sources/2026-04-22-databricks-are-llm-agents-good-at-join-order-optimization):

"Estimating this quantity [subquery size] is as difficult as executing the query itself (in general). … Cardinality estimation is especially difficult, and has led to a wide range of research seeking to improve estimation accuracy."

The key phrase is "as difficult as executing the query itself." Estimators trade accuracy for speed — they must deliver an answer within the optimizer's sub-second budget, so they rely on pre-computed statistics (histograms, distinct-value counts, sample-based joins) that capture only approximate facts about data.

Approaches research has explored

All cited by the Databricks post:

  • Adaptive feedback. Re-estimate after seeing actual intermediate sizes during execution.
  • Deep learning. Learn the distribution conditional on query features.
  • Distribution modelling. Parametric families fit to data.
  • Database theory. Bound-based approaches from combinatorial structure.
  • Learning theory. PAC-style guarantees on estimation error.
  • Factor decompositions. Matrix-style decompositions of joint distributions.

Adjacent to estimation replacement: reinforcement learning over plans, multi-armed bandits for plan selection, Bayesian optimization over plans (see systems/bayesqo).

Canonical failure modes

  • LIKE predicates. String-pattern selectivity is fundamentally hard for histogram-based estimators.
  • Correlated predicates. state='CA' AND city='SF' — naive independence assumption is wildly wrong.
  • Multi-way joins. Errors compound exponentially down the plan tree.
  • Skewed distributions. Histogram bucketing loses tail mass.

Why LLM-agent-as-optimizer sidesteps the problem

Databricks' experiment demonstrates that if you can execute candidate plans offline and measure actual runtimes, you can skip cardinality estimation entirely:

"When using a frontier model, the agent was able to improve query latency by a factor of 1.288 (geomean). This outperforms using perfect cardinality estimates (intractable in practice)."

The agent's execute-and-measure loop beats an oracle that the cost model merely approximates. The estimator layer becomes unnecessary when the search procedure can observe ground truth.

This is the key insight underneath concepts/llm-agent-as-query-optimizer and concepts/offline-query-tuning-loop: trade latency for correctness, use direct measurement, accept that it can't ship in the optimizer hot path.

Seen in

Last updated · 510 distilled / 1,221 read