CONCEPT Cited by 1 source
Agentic Kernel Synthesis¶
Definition¶
Agentic kernel synthesis is the system-level framing that couples four primitives into a production-grade pipeline for authoring GPU/accelerator kernels without hand-tuning:
- An LLM synthesizer that emits candidate kernel source code across multiple DSLs + low-level languages (Triton, CUDA, HIP, MTIA C++, etc.).
- A structured search engine (tree search over LLM candidates) that explores the space of candidates with MCTS + evolutionary strategies.
- A retrieval-augmented knowledge base (RAG over hardware documentation + dynamic skill library with in-context RL) that injects proprietary + standard hardware context into generation prompts.
- A structured evaluation harness (evaluation harness in agent loop) that feeds back why a candidate is slow (memory-bound vs compute-bound vs occupancy-limited), not just scalar wall-clock time.
The combination is what distinguishes "agentic" kernel synthesis from two adjacent things: (a) one-shot LLM code generation, which emits one kernel and stops, and (b) compiler autotuning (TVM, Halide, AutoTVM), which picks parameters from a parameter-space over a fixed kernel template but doesn't synthesize new source code. Agentic kernel synthesis produces new source code and iterates on it with structured feedback (Source: sources/2026-04-02-meta-kernelevolve-how-metas-ranking-engineer-agent-optimizes-ai-infrastructure).
Canonical instance — Meta KernelEvolve (2026-04-02)¶
KernelEvolve is the canonical wiki instance. Six components (the exact framing from Meta's post):
- LLM Synthesizer — emits Triton / TLX / CuTe DSL / FlyDSL / CUDA / HIP / MTIA C++; dynamic context-aware prompts.
- Tree Search Engine — MCTS + evolutionary; configurable memory operator per node (inherit / compare-siblings / combine / clean-slate).
- Retrieval-Augmented Knowledge Base — three categories (correctness / platform-agnostic / hardware-specific) plus self-evolving skill library.
- Automated Evaluation Framework — TritonBench + PyTorch Profiler + NCU + Proton + MTIA Insight; structured diagnostic output via compiler-centric job graphs with MLIR-level instrumentation.
- Shared Data Foundation — every session contributes to a compounding store; early adopters do the hard exploration, subsequent users inherit.
- Agentic Reinforcement Learning — optimization trajectories post-train smaller specialized models with kernel-performance reward (agentic RL from production signal).
Output quality¶
Meta's production numbers from the 2026-04-02 post:
- >60% inference throughput improvement on Andromeda (NVIDIA GPUs) over a torch.compile + vendor libraries baseline — the baseline matters: KernelEvolve is not beating naive code, it's beating aggressively-optimized code.
- >25% training throughput improvement on an ads model on MTIA silicon.
- 100% pass rate on KernelBench (Stanford's 250-problem suite).
- 480 configurations validated (160 PyTorch ATen operators × 3 hardware platforms, 100% correctness).
- "Trillions of daily inference requests" worth of production kernels generated.
Why agentic beats one-shot¶
One-shot LLM code generation fails at the production bar because:
- No search — the synthesizer has no mechanism to explore alternatives when the first candidate is wrong or slow.
- No structured feedback — the synthesizer doesn't know why a candidate is bad; only that it failed some binary test.
- No long-run memory — each session starts from scratch.
Agentic kernel synthesis addresses all three: tree search → structured feedback → persistent skill library. The key qualitative claim Meta makes is that weeks of expert effort collapse to hours of automated search, because the search + feedback + memory loop does what the human would do but parallelized and without cognitive fatigue.
Contrast with compiler autotuning¶
Compiler autotuners (TVM AutoScheduler, Triton's auto-tuner, PyTorch max-autotune) pick parameters from a predefined parameter space over a fixed kernel template. They cannot:
- Author new kernels outside the template space.
- Use hardware documentation to synthesize kernels for silicon they've never been trained on.
- Learn across sessions via a skill library.
Agentic kernel synthesis does all three. The two paradigms are complementary; compiler autotuners will remain useful for parameter-picking within KernelEvolve-generated kernels.
Seen in¶
- Meta KernelEvolve (2026-04-02, canonical). First wiki canonicalisation of the system-level framing. (Source: sources/2026-04-02-meta-kernelevolve-how-metas-ranking-engineer-agent-optimizes-ai-infrastructure)
Related¶
- concepts/kernel-optimization-as-search — the algorithmic primitive; this page is the system-level framing.
- concepts/hardware-proprietary-knowledge-injection — the mechanism enabling proprietary-silicon targets.
- concepts/in-context-reinforcement-learning — the session-to-session learning mechanism.
- systems/kernelevolve — the production instance.
- patterns/tree-search-over-llm-candidates — the search-structure pattern.
- patterns/evaluation-harness-in-agent-loop — the structured-feedback pattern.
- patterns/rag-over-hardware-documentation — the knowledge-injection pattern.
- patterns/agentic-rl-from-production-signal — the data-flywheel pattern.
- companies/meta — canonicalising source.