Skip to content

CONCEPT Cited by 2 sources

Join order optimization

Definition

Join-order optimization is the query-optimizer subproblem of choosing which order to execute a sequence of logical joins in. Because joins are commutative and associative, the same logical query — e.g. Actor ⋈ Stars ⋈ Movie ⋈ Produces ⋈ Company — admits many physically distinct execution plans. Which one is fastest depends entirely on intermediate-result sizes, which depend on the data.

Why it's hard

Three compounding factors (Source: sources/2026-04-22-databricks-are-llm-agents-good-at-join-order-optimization):

  1. Exponential search space. "The number of possible plans grows exponentially with the number of tables — and analytics queries regularly join 20-30 different tables." A 10-way linear join has 10! ≈ 3.6M orderings; bushy plans multiply this further.
  2. Cardinality estimation is nearly as hard as execution itself. "Estimating this quantity [subquery size] is as difficult as executing the query itself (in general)." Get the estimate wrong and the cost model picks a plan that looks cheap but isn't.
  3. Three fragile components must work together. Optimizers compose (a) a cardinality estimator, (b) a cost model comparing candidate plans, (c) a search procedure navigating the exponential space. Each can fail independently; errors compound.

The three components

Per the Databricks post:

"Traditional query optimizers solve this problem with three components: a cardinality estimator designed to quickly guess the size of subqueries (e.g., to guess how many movies Sony has produced), a cost model to compare different potential plans, and a search procedure that navigates the exponentially-large space."

This is the canonical three-leg decomposition. See concepts/query-planner for the broader compiler-analogy framing that contains join ordering as one optimisation sub-phase.

Why it matters

Query 5b of the Join Order Benchmark — a simple 5-way join over IMDb — illustrates the stakes: the default Databricks optimizer picks one join order; an LLM agent finds a different one that is significantly faster because the default estimator underestimates LIKE- predicate selectivity (see concepts/like-predicate-cardinality-estimation-failure). Most production-query slowness on complex joins is an ordering problem, not an execution-engine problem.

Classical approaches

Decades of research: adaptive feedback loops, deep-learning estimators, distribution modelling, database-theory bounds, learning theory, factor decompositions. Alternative architectures replace the whole optimizer with reinforcement learning, multi-armed bandits, Bayesian optimization (see systems/bayesqo), or specialised join algorithms. "All of these solutions add significant complexity to a query optimizer, and require significant engineering effort to integrate, maintain, and productionize."

The LLM-agent alternative

The Databricks/UPenn experiment skips the estimator-and-cost- model scaffolding entirely: give a frontier LLM agent a single tool (execute this join order, return runtime + subplan sizes), a rollout budget, and structured-output constraints that admit only valid orderings. Measurement replaces estimation. See concepts/llm-agent-as-query-optimizer and patterns/llm-agent-offline-query-plan-tuner.

Crucially this works only offline — LLM latency is too high for the query-optimizer hot path — but the offline niche is exactly where human DBA experts spend hours tuning bad plans today. See concepts/offline-query-tuning-loop.

Seen in

Last updated · 510 distilled / 1,221 read