CONCEPT Cited by 1 source

Agentic Kernel Synthesis¶

Definition¶

Agentic kernel synthesis is the system-level framing that couples four primitives into a production-grade pipeline for authoring GPU/accelerator kernels without hand-tuning:

An LLM synthesizer that emits candidate kernel source code across multiple DSLs + low-level languages (Triton, CUDA, HIP, MTIA C++, etc.).
A structured search engine (tree search over LLM candidates) that explores the space of candidates with MCTS + evolutionary strategies.
A retrieval-augmented knowledge base (RAG over hardware documentation + dynamic skill library with in-context RL) that injects proprietary + standard hardware context into generation prompts.
A structured evaluation harness (evaluation harness in agent loop) that feeds back why a candidate is slow (memory-bound vs compute-bound vs occupancy-limited), not just scalar wall-clock time.

The combination is what distinguishes "agentic" kernel synthesis from two adjacent things: (a) one-shot LLM code generation, which emits one kernel and stops, and (b) compiler autotuning (TVM, Halide, AutoTVM), which picks parameters from a parameter-space over a fixed kernel template but doesn't synthesize new source code. Agentic kernel synthesis produces new source code and iterates on it with structured feedback (Source: sources/2026-04-02-meta-kernelevolve-how-metas-ranking-engineer-agent-optimizes-ai-infrastructure).

Canonical instance — Meta KernelEvolve (2026-04-02)¶

KernelEvolve is the canonical wiki instance. Six components (the exact framing from Meta's post):

LLM Synthesizer — emits Triton / TLX / CuTe DSL / FlyDSL / CUDA / HIP / MTIA C++; dynamic context-aware prompts.
Tree Search Engine — MCTS + evolutionary; configurable memory operator per node (inherit / compare-siblings / combine / clean-slate).
Retrieval-Augmented Knowledge Base — three categories (correctness / platform-agnostic / hardware-specific) plus self-evolving skill library.
Automated Evaluation Framework — TritonBench + PyTorch Profiler + NCU + Proton + MTIA Insight; structured diagnostic output via compiler-centric job graphs with MLIR-level instrumentation.
Shared Data Foundation — every session contributes to a compounding store; early adopters do the hard exploration, subsequent users inherit.
Agentic Reinforcement Learning — optimization trajectories post-train smaller specialized models with kernel-performance reward (agentic RL from production signal).

Output quality¶

Meta's production numbers from the 2026-04-02 post:

>60% inference throughput improvement on Andromeda (NVIDIA GPUs) over a torch.compile + vendor libraries baseline — the baseline matters: KernelEvolve is not beating naive code, it's beating aggressively-optimized code.
>25% training throughput improvement on an ads model on MTIA silicon.
100% pass rate on KernelBench (Stanford's 250-problem suite).
480 configurations validated (160 PyTorch ATen operators × 3 hardware platforms, 100% correctness).
"Trillions of daily inference requests" worth of production kernels generated.

Why agentic beats one-shot¶

One-shot LLM code generation fails at the production bar because:

No search — the synthesizer has no mechanism to explore alternatives when the first candidate is wrong or slow.
No structured feedback — the synthesizer doesn't know why a candidate is bad; only that it failed some binary test.
No long-run memory — each session starts from scratch.

Agentic kernel synthesis addresses all three: tree search → structured feedback → persistent skill library. The key qualitative claim Meta makes is that weeks of expert effort collapse to hours of automated search, because the search + feedback + memory loop does what the human would do but parallelized and without cognitive fatigue.

Contrast with compiler autotuning¶

Compiler autotuners (TVM AutoScheduler, Triton's auto-tuner, PyTorch max-autotune) pick parameters from a predefined parameter space over a fixed kernel template. They cannot:

Author new kernels outside the template space.
Use hardware documentation to synthesize kernels for silicon they've never been trained on.
Learn across sessions via a skill library.

Agentic kernel synthesis does all three. The two paradigms are complementary; compiler autotuners will remain useful for parameter-picking within KernelEvolve-generated kernels.

Seen in¶

Meta KernelEvolve (2026-04-02, canonical). First wiki canonicalisation of the system-level framing. (Source: sources/2026-04-02-meta-kernelevolve-how-metas-ranking-engineer-agent-optimizes-ai-infrastructure)

concepts/kernel-optimization-as-search — the algorithmic primitive; this page is the system-level framing.
concepts/hardware-proprietary-knowledge-injection — the mechanism enabling proprietary-silicon targets.
concepts/in-context-reinforcement-learning — the session-to-session learning mechanism.
systems/kernelevolve — the production instance.
patterns/tree-search-over-llm-candidates — the search-structure pattern.
patterns/evaluation-harness-in-agent-loop — the structured-feedback pattern.
patterns/rag-over-hardware-documentation — the knowledge-injection pattern.
patterns/agentic-rl-from-production-signal — the data-flywheel pattern.
companies/meta — canonicalising source.