Skip to content

SYSTEM Cited by 1 source

BayesQO

What it is

BayesQO is an offline query optimizer that applies Bayesian optimization to the join- order search problem. Given a query and a fixed iteration budget, BayesQO proposes candidate join orders using an acquisition function over a surrogate model trained on previously-observed (plan, runtime) pairs, aiming to find a better plan than the native optimizer's choice.

Originally built for PostgreSQL.

Architectural shape

1. Propose candidate plan (initial: random or optimizer's plan)
2. Execute plan → observe runtime
3. Update surrogate model with (plan, runtime)
4. Acquisition function selects next candidate balancing
   exploitation (refine known-good) vs exploration (try
   uncertain)
5. Repeat until budget exhausted
6. Return best plan observed

BayesQO shares the anytime optimizer shape with systems/databricks-join-order-agent — both converge monotonically as budget grows — but differs in the candidate- proposal mechanism:

Axis BayesQO Databricks LLM agent
Proposal mechanism Gaussian-process / tree-based surrogate + acquisition function Frontier LLM with grammar-constrained structured output
Domain knowledge None (learned from rollouts only) Learned priors from training corpus
Inspection of intermediate results Scalar runtime only Runtime + per-subplan sizes
Target engine PostgreSQL Databricks

Why it appears on the wiki

BayesQO is the prior-art baseline that Databricks' LLM-agent experiment compares itself against (Source: sources/2026-04-22-databricks-are-llm-agents-good-at-join-order-optimization). The post's framing:

"This outperforms using perfect cardinality estimates (intractable in practice), smaller models, and the recent BayesQO offline optimizer (although BayesQO was designed for PostgreSQL, not Databricks)."

The parenthetical is important: BayesQO wasn't tuned for the Databricks engine, so the comparison is asymmetric. The result frames Bayesian-optimization-over-plans as a weaker baseline than LLM-directed search for this class of problem, at least on the Databricks execution engine and the JOB benchmark.

Reference

Project link from the source: https://rm.cab/bayesqo

Contrast with LLM-agent approach

The core architectural disagreement: what does "propose the next candidate" entail?

  • BayesQO: a scalar-objective statistical model with an acquisition function — formally principled, domain-agnostic, no transfer of knowledge from prior databases or plan literature.
  • LLM agent: a pattern-matcher against its training corpus — informally principled (correctness guaranteed only by the grammar and execution timeout), domain-aware, implicitly transfers knowledge.

The Databricks result is evidence that, at least for join-ordering on a modern query engine, the LLM's domain knowledge advantage outweighs the BO's statistical rigour. This generalises the broader pattern: where an LLM has pattern- matching coverage of a domain's solution space, agent search often beats statistical search.

Seen in

Last updated · 510 distilled / 1,221 read