Databricks — Are LLM agents good at join order optimization?¶

Summary¶

Databricks (with UPenn collaborators) ran an experiment applying a frontier LLM agent to the decades-old join-ordering problem in relational query optimizers. Because single-shot LLM calls are too slow for the hot path of a query optimizer (sub-second budget), they instead built an offline agentic tuner: the agent gets one tool — execute a candidate join order and return its runtime plus subplan sizes — and a 50-rollout budget per query. Each tool call uses structured output constrained by a grammar that admits only valid join reorderings (patterns/structured-output-grammar-for-valid-plans).

Evaluated on all 113 queries of the Join Order Benchmark (JOB) over a 10× scaled IMDb dataset, with 15 rollouts per query, the agent improved workload latency by 1.288× geomean and dropped P90 latency by 41%, beating both perfect cardinality estimates (intractable in practice) and the prior BayesQO offline optimizer (designed for Postgres, not Databricks). Biggest wins were at the distribution tail, where cardinality estimators fail hardest — e.g. query 5b, a 5-way join with a LIKE predicate on VHS release notes that the default Databricks optimizer misjudges.

Key takeaways¶

Join ordering is an exponentially-large search guided by three fragile components. Traditional optimizers assemble plans from a cardinality estimator (guesses subquery sizes), a cost model (compares candidate plans), and a search procedure (navigates the plan space). "The number of possible plans grows exponentially with the number of tables — and analytics queries regularly join 20-30 different tables." Cardinality estimation is the weakest link and has spawned decades of research across adaptive feedback, deep learning, distribution modelling, database theory, and learning theory. (Source: this post)
LLMs can't go in the query optimizer hot path. "Query optimizers typically need to pick join orders in a few hundred milliseconds, so integrating an LLM into the hot path of the query optimizer, while potentially promising, is not possible today." This rules out the straightforward approach and motivates the offline tuning framing — mirroring how human DBA experts already debug bad join orders today (iterate, re-execute, inspect). See concepts/offline-query-tuning-loop.
The agent has exactly one tool: execute a candidate join order. Each tool call returns the join-order runtime (capped by the original query's runtime as a timeout) and per-subplan sizes (equivalent to Databricks' EXPLAIN EXTENDED output). The agent is free to use the 50-rollout budget as it sees fit: pure exploitation (refine a promising plan), pure exploration (try different shapes), or a mix. See concepts/exploration-exploitation-tradeoff-in-agent-search and patterns/llm-agent-offline-query-plan-tuner. "The LLM gets to act like an offline experimenter that tries many candidate plans and learns from the observed outcomes — just like a human tuning a join order by hand!"
Structured output enforces plan validity. "Each tool call generates a join ordering using structured model outputs, which forces the model's output to match a grammar we specify to only admit valid join reorderings." This is a canonical instance of the patterns/structured-output-grammar-for-valid-plans pattern — use the grammar/schema surface of structured decoding not just for parseability (concepts/structured-output-reliability) but to prune semantically-invalid candidates before generation, converting an open-ended reasoning task into a constrained search over a known-legal space.
Evaluation: JOB on 10× IMDb, 15 rollouts × 113 queries. "We used the Join Order Benchmark (JOB), a set of queries that were designed to be difficult to optimize. Since the dataset used by JOB, the IMDb dataset, is only around 2GB (and therefore Databricks could process even poor join orderings fairly quickly), we scaled up the dataset by duplicating each row 10 times." This is a deliberate methodological choice: too-small datasets make even bad plans fast enough to hide optimizer differences.
Result: 1.288× geomean speedup; 41% P90 drop. The aggregate number understates the result — the real gains live in the tail of the distribution where the default optimizer's cardinality estimates are worst. "When using a frontier model, the agent was able to improve query latency by a factor of 1.288 (geomean). This outperforms using perfect cardinality estimates (intractable in practice), smaller models, and the recent BayesQO offline optimizer." Note the striking phrase "outperforms perfect cardinality estimates" — the agent's direct measurement + iteration loop beats an oracle that traditional cost-model-based optimizers merely approximate.
Anytime-algorithm property. "Our agent progressively improves the workload with each tested plan (sometimes called a rollout), creating a simple anytime algorithm where larger time budgets can be translated into further query performance." See concepts/anytime-optimization-algorithm and patterns/rollout-budget-anytime-plan-search — turning an unbounded search into a budgeted one with monotonic improvement guarantees.
Query 5b: the canonical LIKE-predicate failure case. The biggest single improvement found by the agent was on query 5b of JOB: "a simple 5-way join which looks for American production companies that released a post-2010 movie on VHS with a note referencing 1994. The Databricks optimizer focused first on finding American VHS production companies (which is indeed selective, producing only 12 rows). The agent finds a plan that first looks for VHS releases referencing 1994, which turns out to be significantly faster. This is because the query uses LIKE predicates to identify VHS releases, which are exceptionally difficult for cardinality estimators." This is the canonical illustration of concepts/like-predicate-cardinality-estimation-failure — string-pattern selectivity is a known blind spot for histogram-based estimators, and the agent's execute-and- measure loop sidesteps it entirely.
Beats BayesQO (Bayesian optimization) on Databricks. "The recent BayesQO offline optimizer (although BayesQO was designed for PostgreSQL, not Databricks)" — the comparison is asymmetric (BayesQO wasn't tuned for the Databricks engine) but the result still frames Bayesian-optimization-over-plans as a weaker baseline than LLM-directed search for this class of problem. See systems/bayesqo and concepts/bayesian-optimization-over-parameter-space for contrast.
Open questions the authors name. The post ends with three architectural questions that map directly onto productionization tensions: (a) What tools should the agent have? Beyond execute a plan, could the agent issue targeted cardinality queries or data-assumption probes (e.g. "are there any pre-1995 DVD releases?")? (b) When should tuning be triggered? User-flagged queries are obvious; can the optimizer itself detect "optimization potential" automatically? (c) Can the agent's wins teach the underlying optimizer? A better plan is a proof the default optimizer made a systematic error — harvest the errors and fix the estimator. This is a hint at the agent-as-training-signal loop that didn't ship in this post but is the obvious next move.

Systems extracted¶

System	Role
systems/databricks-join-order-agent	Prototype LLM agent with single `execute-plan` tool, 50-rollout budget, grammar-constrained structured output
systems/join-order-benchmark-job	113-query JOB benchmark over the IMDb dataset, scaled 10× for this experiment
systems/bayesqo	Prior-art Bayesian-optimization-based offline query optimizer for Postgres; the closest baseline
systems/databricks	Execution substrate; the agent operates on Databricks' query engine

Concepts extracted¶

Concept	Summary
concepts/join-order-optimization	Choose one of N! (or more, with bushy variants) join orderings to minimise total cost; exponentially large, dominated by cardinality estimation
concepts/cardinality-estimation	Guess subquery result sizes to compare candidate plans; the hardest subproblem in optimization
concepts/llm-agent-as-query-optimizer	Apply a frontier LLM as an offline iterative tuner over executable plan variants
concepts/anytime-optimization-algorithm	Iterative search with monotonic best-so-far; larger budgets → better plans
concepts/like-predicate-cardinality-estimation-failure	String-pattern predicates are a canonical blind spot for histogram-based estimators
concepts/exploration-exploitation-tradeoff-in-agent-search	Rollout budget allocation: try known-good variants vs. risky-but-informative alternatives
concepts/offline-query-tuning-loop	Long-horizon, out-of-hot-path query tuning mirroring human DBA workflow

Patterns extracted¶

Pattern	Summary
patterns/llm-agent-offline-query-plan-tuner	Agent with single `execute-plan` tool + rollout budget + best-of-N selection
patterns/structured-output-grammar-for-valid-plans	Grammar-constrained structured output prunes invalid candidates before generation
patterns/rollout-budget-anytime-plan-search	Bound search by rollout count; monotonic improvement gives a time-budget knob

Operational numbers¶

Metric	Value
Agent rollout budget (evaluation)	15 per query
Agent rollout budget (max in prototype)	50 per query
Benchmark	JOB (113 queries)
Dataset	IMDb, 10× row-duplication (base ~2 GB)
Geomean query-latency speedup	1.288×
P90 query latency drop	41%
Beats perfect cardinality estimates	✅
Beats smaller LLM models	✅
Beats BayesQO (note: designed for Postgres)	✅

Caveats¶

Research prototype, not production. This is an experimental result from a UPenn collaboration, not a shipping Databricks feature. The 50-rollout budget implies 50 executions of candidate plans per tuned query — viable offline, not online.
Comparison asymmetry with BayesQO. BayesQO was built for Postgres; running it against Databricks isn't an apples-to- apples test. The post acknowledges this.
Cost not disclosed. Frontier-model inference cost per tuned query (50 tool calls, each carrying plan context) is not reported — critical for deciding when the tuning pays off.
Trigger policy unsolved. "When should this agentic optimization be triggered?" is an open question. The obvious answer (user-flagged slow queries) works for operators but doesn't give automatic optimization. Detecting "optimization potential" from telemetry is explicitly named as unsolved.
Tool-surface minimalism. The agent has one tool. The authors themselves ask whether giving it cardinality-probe tools or data-assumption probes would help. Ingesting as the single-tool baseline, not the final shape.
Tier-3 source with borderline inclusion. Databricks' blog is Tier-3 and this is a research-result post (ML/LLM applied to DB internals), but the content is substantively about database-engine optimizer architecture — join ordering, cardinality estimation, cost models, plan search — which is core system-design material, not ML methodology. Ingested for the architectural framing, not the ML details.

Source¶

companies/databricks
systems/databricks
concepts/query-planner — the broader optimizer architecture this post's agent augments
concepts/structured-output-reliability — the sibling use of grammar-constrained decoding (for parseability); this post uses grammars for semantic validity of plans
concepts/bayesian-optimization-over-parameter-space — BayesQO's foundational method, used here as a baseline
patterns/agent-spawn-parallel-exploration — contrasting shape: here the agent runs sequentially within one budget, not parallel sub-agents