Skip to content

SYSTEM Cited by 1 source

Databricks Join Order Agent

What it is

Databricks Join Order Agent is a research prototype (Source: sources/2026-04-22-databricks-are-llm-agents-good-at-join-order-optimization) developed by Databricks in collaboration with UPenn that applies a frontier-model LLM agent to the join-order optimization problem on the Databricks query engine.

The agent is the canonical first wiki instance of LLM-agent-as-query- optimizer, and demonstrates the offline query-plan tuner pattern.

Architecture

Minimal agent loop:

┌───────────────────────────────────────────────┐
│            Frontier LLM (agent)               │
│                                               │
│  system: grammar-constrained structured       │
│    output admitting only valid join orderings │
│  context: query, schema, prior-rollout        │
│    results (runtime + subplan sizes)          │
└───────────────────────────────────────────────┘
                     │ tool call
┌───────────────────────────────────────────────┐
│     execute_plan(candidate_join_order)        │
│  →  Databricks query engine executes the plan │
│     (timeout = original query's runtime)      │
│  →  returns (runtime_ms, per-subplan sizes)   │
└───────────────────────────────────────────────┘
              [loop N rollouts]
       return best plan observed

Tool surface

One tool. execute_plan(candidate_order) runs the specified join order through Databricks' execution engine (with the original query's runtime as the timeout cap) and returns:

  • Runtime in milliseconds.
  • Per-subplan sizes, equivalent to Databricks' EXPLAIN EXTENDED output — the size of each intermediate join's result set.

Structured output constraint

Each tool call's input (the candidate join order) is generated via grammar-constrained structured output that admits only semantically-valid join reorderings. This eliminates the failure mode where the agent proposes a syntactically-valid but semantically-invalid plan (e.g. missing a table, duplicating one). See patterns/structured-output-grammar-for-valid-plans.

Budget

Rollout count is the budget, not wall-clock:

  • 50 rollouts in the prototype (maximum the agent can use).
  • 15 rollouts in evaluation (the reported results use this tighter budget to amortise across 113 queries).

This makes the agent a concrete instance of patterns/rollout-budget-anytime-plan-search — larger budgets buy better plans monotonically.

Selection

Best-of-N. After exhausting the budget, return the plan with the lowest observed runtime. The agent can waste rollouts on exploratory variants; only the best is surfaced.

Evaluation

Variable Value
Benchmark systems/join-order-benchmark-job (JOB)
Query count 113
Dataset IMDb (~2 GB base), scaled 10× via row duplication
Rollouts per query 15
Model Frontier (unnamed in post)
Geomean workload speedup 1.288×
P90 query-latency drop 41%
Baseline 1 Databricks default optimizer
Baseline 2 Perfect cardinality estimates — beaten
Baseline 3 Smaller LLMs — beaten
Baseline 4 systems/bayesqo (Postgres-tuned) — beaten

The striking claim is "outperforms perfect cardinality estimates" — direct execute-and-measure beats the theoretical oracle that traditional cost-model-based optimizers merely approximate.

Canonical win: query 5b

JOB's query 5b (5-way join looking for American VHS production companies with post-2010 movies referencing 1994) is the post's illustrated case study. The Databricks default optimizer first finds "American VHS production companies" (12 rows, selective by the estimator's numbers) then joins outward. The agent instead starts from VHS-releases-referencing-1994, which is even more selective — and the default estimator can't tell because the predicates are LIKE patterns (see concepts/like-predicate-cardinality-estimation-failure).

Open architectural questions

The paper explicitly names three:

  1. Tool surface expansion. Should the agent have cardinality-probe or data-assumption tools beyond execute_plan?
  2. Automatic triggering. Move from user-flagged slow queries to proactive detection of "optimization potential" from telemetry.
  3. Agent-as-training-signal. Every better plan the agent finds is proof of a systematic estimator error. Harvest these corrections to improve the native optimizer.

Production status

Research prototype. Not a shipping Databricks feature. UPenn research collaboration. Cost per tuned query (frontier inference × N rollouts × plan context) not disclosed.

Last updated · 510 distilled / 1,221 read