Skip to content

PATTERN Cited by 1 source

Tree Search over LLM Candidates

Problem

One-shot LLM code generation produces one candidate and stops. For problems where the search space is combinatorial (kernel authoring, query plans, agent plans) and correctness + performance are non-trivially verifiable, one-shot is brittle — the first candidate often fails, and without structured search the agent has no mechanism to explore alternatives, learn from failures, or combine insights across attempts.

Shape

Frame the LLM as a candidate generator inside a tree-search engine:

  • Each candidate = tree node. The LLM emits the node's content (kernel source, query plan, agent-action trace, etc.).
  • Search algorithms: Monte Carlo tree search (MCTS) + evolutionary strategies. MCTS balances exploitation of known-good branches against exploration of novel directions; evolutionary operators (mutation, crossover) diversify the frontier.
  • Evaluation harness grades each node — correctness, performance, structured diagnostic metrics (see evaluation harness in agent loop).
  • Node expansion policy uses the grading to decide which nodes to expand next and what transformation to apply.

The configurable memory operator primitive is the key extension over naive MCTS. Each node carries a memory mode determining how it draws context from the search tree when generating the next round of children:

Mode Purpose
Inherit parent's trajectory Refine a promising direction
Compare against siblings Learn what differentiates high-performing variants
Combine parent + siblings Merge insights
Clean-slate restart Escape local optima; inject diversity

This takes the search "beyond simple independent sampling" — sibling nodes collaborate by surfacing complementary strategies, parent-child chains preserve successful paths, and clean-slate restarts unstick the search.

Canonical instance — Meta KernelEvolve (2026-04-02)

KernelEvolve's Tree Search Engine is the canonical wiki instance. Integration with the rest of the agent:

  • LLM Synthesizer produces candidate kernels in Triton / CUDA / HIP / MTIA C++ / etc.
  • Tree Search Engine wraps the synthesizer with MCTS + evolutionary expansion + configurable memory operator per node.
  • Retrieval-Augmented Knowledge Base feeds hardware docs + distilled skills into each node's prompt.
  • Evaluation Harness produces structured diagnostic signal (memory-bound / compute-bound / occupancy-limited).
  • Structured diagnostic signal flows back to the Tree Search Engine to guide next expansion.

Meta's production results: hundreds of candidates per optimization session; >60% Andromeda inference throughput over torch.compile + vendor libraries baseline; 100% KernelBench pass rate; "trillions of daily inference requests" worth of kernels produced.

Distinct from

  • One-shot LLM code generation — no search, no iteration, no structured feedback.
  • Compiler autotuning (TVM, Halide, AutoTVM) — picks parameters in a fixed template space; doesn't synthesize new source code.
  • Flat beam search over LLM outputs — has multiple candidates but no tree structure, no memory operator, no iteration with evaluation feedback.

The configurable memory operator is the specific wiki primitive distinguishing this pattern from naive MCTS-over-LLM setups.

Consequences

Positive:

  • Candidate diversity with principled convergence — exploration vs exploitation is the balance MCTS is built to manage.
  • Parallelizable — candidates can be evaluated in parallel; Meta runs "hundreds of candidates in parallel" on its distributed evaluation infrastructure.
  • Structured-feedback compatible — every candidate receives a multi-dimensional grading (not just a scalar loss) that feeds both the search engine and the LLM synthesizer's next prompt.

Negative / care required:

  • Evaluation harness cost — multi-minute build cycles + infrastructure failures at scale. Meta built a "purpose-built long-running job harness" to absorb this; naive implementations will struggle.
  • Search budget management — kernel-optimization sessions are bounded by time + compute. Termination criteria (target hit / budget exhausted / progress stalled) need explicit tuning.
  • Memory operator choice per node — if all nodes default to "inherit parent", the search collapses to depth-first refinement. If all default to "clean-slate", it degenerates to independent sampling. Meta gives no explicit tuning guidance in the post.

Seen in

Last updated · 550 distilled / 1,221 read