CONCEPT Cited by 1 source
Exploration-exploitation tradeoff in agent search¶
Definition¶
The exploration-exploitation tradeoff in LLM-agent search is the per-rollout decision: should the agent refine a known-promising candidate (exploit) or try a risky-but- informative alternative (explore)? With a fixed rollout budget, every rollout spent on exploitation is one not spent on exploration, and vice versa.
Why it's a first-class design concern¶
Classical optimization exposes the tradeoff via acquisition functions (Expected Improvement, Upper Confidence Bound in Bayesian optimization). LLM-agent search has no explicit acquisition function — the allocation is implicit in the agent's reasoning. This makes it harder to reason about and harder to tune, but also more adaptive when the agent has relevant domain knowledge from its training data.
The Databricks join-order formulation¶
From the Databricks experiment (Source: sources/2026-04-22-databricks-are-llm-agents-good-at-join-order-optimization):
"We let the agent run for 50 iterations, allowing the agent to freely try out different join orders. The agent is free to use these 50 iterations to test out promising plans (exploitation), or to explore risky-but-informative alternatives (exploration). Afterwards, we collect the best performing join order tested by the agent, which becomes our final result."
Three load-bearing choices:
- Budget is fixed at the rollout level (50 prototype, 15 eval), not wall-clock. This isolates from model-latency noise and gives a clean anytime-algorithm knob. See concepts/anytime-optimization-algorithm.
- The agent controls the mix itself — no outer scheduler forces exploration quotas. The agent's prompt and its reading of prior rollout results decides.
- Best-of-N selection means a single brilliant exploration wins even if 49 rollouts were wasted. This tilts the optimal policy toward more exploration than classical regret-minimising bandits would recommend.
Why LLM agents may be good at this¶
Traditional optimization algorithms are forced into balance-by- heuristic because they have no domain knowledge. An LLM agent brings (a) learned priors about what plan shapes tend to work for what query shapes, and (b) the ability to read the subplan- size output from each rollout and hypothesise which part of the prior plan was wrong. In principle this converts the explore decision from "try something random" to "try something that pattern-matches to prior successful fixes."
Whether this actually helps vs. plain best-of-N sampling is the empirical question the Databricks result answers favourably for join ordering — the agent + grammar-constrained structured output beats both smaller models and the classical BayesQO baseline.
Failure modes to watch¶
- Exploitation collapse. Agent latches onto one plan family and wastes rollouts on micro-variations.
- Exploration thrash. Agent cycles through unrelated plan shapes without converging.
- Context rot. Accumulated rollout history grows past the context window; earlier-discovered good plans get dropped.
Production systems typically wrap the agent with an outer loop enforcing minimum exploration (diversity) and maximum exploitation (prevent re-testing the same plan) — converting the implicit tradeoff into a partially-explicit one.
Seen in¶
- sources/2026-04-22-databricks-are-llm-agents-good-at-join-order-optimization — Canonical first wiki instance. Explicit naming of exploration vs exploitation within rollout budget; best-of-N selection; 50 iterations prototype / 15 eval.
Related¶
- concepts/llm-agent-as-query-optimizer — the containing architecture
- concepts/anytime-optimization-algorithm — the broader algorithmic shape
- concepts/bayesian-optimization-over-parameter-space — the classical tradeoff-by-acquisition-function alternative
- patterns/rollout-budget-anytime-plan-search — the budget knob that structures the tradeoff
- patterns/llm-agent-offline-query-plan-tuner — full pattern with single tool, rollout budget, grammar-constrained output