Skip to content

CONCEPT Cited by 1 source

Agentic Kernel Synthesis

Definition

Agentic kernel synthesis is the system-level framing that couples four primitives into a production-grade pipeline for authoring GPU/accelerator kernels without hand-tuning:

  1. An LLM synthesizer that emits candidate kernel source code across multiple DSLs + low-level languages (Triton, CUDA, HIP, MTIA C++, etc.).
  2. A structured search engine (tree search over LLM candidates) that explores the space of candidates with MCTS + evolutionary strategies.
  3. A retrieval-augmented knowledge base (RAG over hardware documentation + dynamic skill library with in-context RL) that injects proprietary + standard hardware context into generation prompts.
  4. A structured evaluation harness (evaluation harness in agent loop) that feeds back why a candidate is slow (memory-bound vs compute-bound vs occupancy-limited), not just scalar wall-clock time.

The combination is what distinguishes "agentic" kernel synthesis from two adjacent things: (a) one-shot LLM code generation, which emits one kernel and stops, and (b) compiler autotuning (TVM, Halide, AutoTVM), which picks parameters from a parameter-space over a fixed kernel template but doesn't synthesize new source code. Agentic kernel synthesis produces new source code and iterates on it with structured feedback (Source: sources/2026-04-02-meta-kernelevolve-how-metas-ranking-engineer-agent-optimizes-ai-infrastructure).

Canonical instance — Meta KernelEvolve (2026-04-02)

KernelEvolve is the canonical wiki instance. Six components (the exact framing from Meta's post):

  1. LLM Synthesizer — emits Triton / TLX / CuTe DSL / FlyDSL / CUDA / HIP / MTIA C++; dynamic context-aware prompts.
  2. Tree Search Engine — MCTS + evolutionary; configurable memory operator per node (inherit / compare-siblings / combine / clean-slate).
  3. Retrieval-Augmented Knowledge Base — three categories (correctness / platform-agnostic / hardware-specific) plus self-evolving skill library.
  4. Automated Evaluation Framework — TritonBench + PyTorch Profiler + NCU + Proton + MTIA Insight; structured diagnostic output via compiler-centric job graphs with MLIR-level instrumentation.
  5. Shared Data Foundation — every session contributes to a compounding store; early adopters do the hard exploration, subsequent users inherit.
  6. Agentic Reinforcement Learning — optimization trajectories post-train smaller specialized models with kernel-performance reward (agentic RL from production signal).

Output quality

Meta's production numbers from the 2026-04-02 post:

  • >60% inference throughput improvement on Andromeda (NVIDIA GPUs) over a torch.compile + vendor libraries baseline — the baseline matters: KernelEvolve is not beating naive code, it's beating aggressively-optimized code.
  • >25% training throughput improvement on an ads model on MTIA silicon.
  • 100% pass rate on KernelBench (Stanford's 250-problem suite).
  • 480 configurations validated (160 PyTorch ATen operators × 3 hardware platforms, 100% correctness).
  • "Trillions of daily inference requests" worth of production kernels generated.

Why agentic beats one-shot

One-shot LLM code generation fails at the production bar because:

  • No search — the synthesizer has no mechanism to explore alternatives when the first candidate is wrong or slow.
  • No structured feedback — the synthesizer doesn't know why a candidate is bad; only that it failed some binary test.
  • No long-run memory — each session starts from scratch.

Agentic kernel synthesis addresses all three: tree search → structured feedback → persistent skill library. The key qualitative claim Meta makes is that weeks of expert effort collapse to hours of automated search, because the search + feedback + memory loop does what the human would do but parallelized and without cognitive fatigue.

Contrast with compiler autotuning

Compiler autotuners (TVM AutoScheduler, Triton's auto-tuner, PyTorch max-autotune) pick parameters from a predefined parameter space over a fixed kernel template. They cannot:

  • Author new kernels outside the template space.
  • Use hardware documentation to synthesize kernels for silicon they've never been trained on.
  • Learn across sessions via a skill library.

Agentic kernel synthesis does all three. The two paradigms are complementary; compiler autotuners will remain useful for parameter-picking within KernelEvolve-generated kernels.

Seen in

Last updated · 550 distilled / 1,221 read