SYSTEM Cited by 1 source
KernelEvolve¶
KernelEvolve is Meta's agentic kernel-authoring system — a production agent that autonomously generates and optimizes GPU/accelerator kernels across Meta's heterogeneous AI hardware fleet (NVIDIA GPUs, AMD GPUs, Meta's custom MTIA silicon, CPUs). It is used by Meta's Ranking Engineer Agent (REA) and applies generally beyond ads ranking. Unlike one-shot LLM code generation, KernelEvolve frames kernel optimization as a structured search problem over the space of possible implementations, explores hundreds of candidates per session, and routes structured profiling feedback back into the LLM synthesizer until a kernel meets or exceeds human-expert performance.
Headline production results (2026-04-02 post):
- >60% inference throughput improvement on Meta's Andromeda ads retrieval model on NVIDIA GPUs, over a baseline already optimized with
torch.compile+ vendor libraries. - >25% training throughput improvement on an (unnamed) ads model on MTIA silicon.
- Serves "trillions of daily inference requests" worth of production kernel code at Meta.
- 100% pass rate on Stanford's 250-problem KernelBench benchmark (all generated kernels correct and faster than PyTorch reference).
- 480 configurations validated — 160 PyTorch ATen operators × 3 hardware platforms, 100% correctness.
Paper at ISCA 2026: KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta.
Six components¶
1. LLM Synthesizer¶
Generates candidate kernels across the full DSL + language stack Meta uses internally:
- High-level DSLs: Triton, TLX (github.com/facebookexperimental/triton), CuTe DSL (NVIDIA), FlyDSL (Meta).
- Low-level backends: CUDA (NVIDIA), HIP (AMD), MTIA C++.
Uses dynamic context-aware prompts enriched with runtime diagnostics, hardware constraints, and prior-candidate history — "a single adaptive interface that unifies [debugging, performance tuning, correctness verification] workflows into a single adaptive interface that drives a continuous, feedback-driven optimization loop." Replaces the traditional approach of maintaining separate prompt templates per task type and per hardware platform.
2. Tree Search Engine¶
Drives the exploration. Combines Monte Carlo tree search + evolutionary strategies. Each kernel candidate is a node in a search tree; the engine selects promising candidates, applies transformations, evaluates, and decides whether to explore further or backtrack — balancing exploitation of known-good strategies against exploration of novel approaches.
Critical design detail — configurable memory operator per node:
"Each node carries a configurable memory operator that determines how it draws context from the search tree when generating the next round of candidates. A node may inherit its parent's optimization trajectory to refine a promising direction, compare against siblings to learn what differentiates high-performing variants, combine insights from both parent and sibling histories, or start with a clean slate to escape local optima."
Four modes enumerated: inherit-parent / compare-siblings / combine-both / clean-slate. This selective-memory mechanism is the structural primitive that takes the search beyond "simple independent sampling" — sibling nodes collaborate by surfacing complementary strategies, parent-child chains preserve successful optimization paths, and memory-free restarts inject diversity when the search stagnates.
3. Retrieval-Augmented Knowledge Base¶
Three-category hierarchical knowledge base retrieved dynamically based on runtime signals:
- Correctness constraints — valid kernel shape rules.
- Platform-agnostic optimization guidance — debugging + tuning strategies that transfer across platforms.
- Hardware-specific documentation — architecture manuals, instruction set references, memory hierarchy specifications, optimization patterns for each accelerator platform.
Triggered by runtime signals: "a memory bandwidth bottleneck triggers retrieval of memory hierarchy documentation; a compilation error activates debugging guidance."
Self-evolving skill library — successful optimization strategies are distilled into reusable skills (compact optimization patterns + debugging heuristics) and continuously written back into the knowledge base. Meta calls this "a form of in-context reinforcement learning" — each successful exploration enriches the context available to future sessions, enabling faster convergence on similar problems without model retraining.
4. Automated Evaluation Framework¶
Validates every kernel on two axes — correctness (bitwise accuracy against PyTorch references) and performance — but goes far beyond a single runtime number. Stack of profiling tools composed via a compiler-centric abstraction using job graphs (compiler transforms insert MLIR-level instrumentation, profiling passes collect metrics, trace synthesis produces structured output):
- TritonBench — numerical correctness against PyTorch baselines + end-to-end speedup across production input shapes.
- PyTorch Profiler — system-level execution timelines including kernel-launch overhead and host-device synchronization.
- NCU (Nsight Compute) (GPU) — kernel-level hardware metrics: occupancy, memory throughput, instruction mix.
- Proton (GPU) — intra-kernel instruction-level latency + pipeline behavior.
- MTIA Insight (MTIA) — PE utilization, fixed-function engine metrics (DPE / SFU / MLU utilization + stall cycles), cache behavior, per-PE memory bandwidth counters.
The search engine doesn't just receive "kernel A is 1.2x faster than kernel B" — it receives why (memory-bound vs compute-bound vs occupancy-limited) and feeds that diagnostic signal back to the LLM synthesizer for the next round. Canonical wiki instance of evaluation harness in agent loop.
The harness "handles the multi-minute build cycles and infrastructure failures that make naive approaches impractical" and evaluates hundreds of candidates in parallel on Meta's distributed infrastructure.
5. Shared Data Foundation¶
Every optimization session contributes to a shared data store. "When one engineer's exploration discovers an effective tiling strategy for a class of operators, that insight becomes available to every future session targeting similar workloads — creating a compounding effect where the system grows more capable with each use. Early adopters perform the hardest exploration; subsequent users inherit much closer to optimal starting points and refine from there."
The compounding effect is the primary mechanism that pushes wall-clock-per-optimization-session down over time without model retraining.
6. Agentic Reinforcement Learning¶
Every optimization session generates structured training data as a natural byproduct: agentic trajectories capturing the reasoning + code transformations + evaluation feedback behind high-performing kernels. Meta calls this data "rare and valuable — it encodes optimization intuition that no public dataset contains."
This data is used to post-train smaller, specialized models via agentic RL with the reward signal coming directly from measured kernel performance. The virtuous cycle: better models → better kernels in fewer reasoning tokens + fewer search steps → higher-quality training data → still-better models. Over successive iterations Meta "self-hosts increasingly efficient models that are compact enough to run cost-effectively at scale while retaining the optimization capability of much larger frontier models." Canonical wiki instance of agentic RL from production signal.
Hardware coverage¶
One unified framework across:
- NVIDIA GPUs (primary language: CUDA + Triton + CuTe DSL).
- AMD GPUs (HIP).
- Meta's custom MTIA silicon — four chip generations in two years (MTIA 300 → 500), each with "new compute capabilities, memory bandwidth characteristics, and numeric data types."
- CPUs.
Hardware-specific constraints + optimization patterns are retrieved from the knowledge base per-target — not maintained as separate prompt templates. This is the mechanism that makes proprietary-silicon kernel generation (MTIA) tractable on LLMs that have never seen MTIA code.
The MTIA result¶
The most consequential capability, architecturally:
"Meta's custom MTIA chips present a unique programming challenge. Because these chips are proprietary, no public LLM has been trained on MTIA code. A standard coding assistant lacks the context to write optimized MTIA kernels because it has never seen MTIA documentation, instruction set details, or programming idioms."
KernelEvolve solves this through systematic knowledge injection: MTIA-specific documentation (architecture manuals, ISA references, memory hierarchy specifications, optimization patterns) is encoded directly into the retrieval-augmented knowledge base. When the system targets MTIA, it retrieves and incorporates this proprietary knowledge into its reasoning, "effectively 'learning' the hardware in real time."
The engineering-cost inversion:
"When a new chip arrives, the engineering cost shifts from writing thousands of kernels by hand to curating a set of hardware documents and injecting them into the knowledge base."
Canonical wiki instance of hardware proprietary knowledge injection + RAG-over-hardware-documentation.
End-to-end flow¶
An engineer specifies target operator + hardware platform + performance goals. The system then:
- Retrieves relevant hardware documentation + optimization knowledge from the knowledge base.
- Generates an initial set of kernel candidates via the LLM synthesizer with context-aware prompting.
- Evaluates each candidate for correctness + performance via distributed benchmarking infrastructure.
- Feeds results back into the search engine, which selects most promising candidates and applies further optimizations (new tree-node expansions).
- Iterates 1–4 until a termination criterion is met: performance target achieved, search budget exhausted, or progress stalls.
- Outputs the best-performing, fully validated kernel, ready for production deployment.
Persistent storage of search trees + implementations lets the system build on prior results when targeting new model variants or hardware generations.
Positioning within REA¶
Ranking Engineer Agent (REA) is Meta's broader autonomous-AI system accelerating Ads Ranking innovation. Two capabilities disclosed so far:
- ML Exploration (2026-03-17 post) — "autonomously designs, executes, and analyzes ranking model experiments." Discovers better models.
- KernelEvolve (this 2026-04-02 post) — makes discovered models production-ready by generating optimized kernels for the heterogeneous hardware the models run on.
"Within REA, ML Exploration discovers better models. KernelEvolve makes them production-ready. Together, they accelerate how quickly ranking improvements reach advertisers."
Meta names forward applications beyond kernel optimization: "hybrid model search, compiler optimization, memory management, and system configuration" — the underlying techniques (structured reasoning + retrieval-augmented knowledge + closed-loop evaluation) are framed as general-purpose.
Seen in (wiki)¶
- Meta KernelEvolve (2026-04-02, canonical). This page. First wiki instance of agentic-kernel-synthesis at hyperscale. (Source: sources/2026-04-02-meta-kernelevolve-how-metas-ranking-engineer-agent-optimizes-ai-infrastructure)
Caveats¶
Architecture-overview voice. Production numbers disclosed (Andromeda +60% inference, MTIA ads model +25% training, 100% KernelBench, 480 validated configurations, trillions of daily inference requests). But several implementation details deferred to the ISCA 2026 paper — not ingested here — including: per-session compute budget, search-tree depth + branching, agentic-RL-model architecture + parameter count, concrete profiling-signal schema, distributed evaluation infrastructure design, human-in-the-loop review gate before production deployment.
Related¶
- companies/meta — parent company; tier-1 source.
- systems/meta-mtia — Meta's custom AI silicon; the proprietary-hardware target that motivates the RAG-over-hardware-docs architecture.
- systems/meta-adaptive-ranking-model — the LLM-scale ads-ranking serving model whose inference infrastructure KernelEvolve-generated kernels contribute to.
- systems/meta-andromeda-ads — Andromeda retrieval model; the headline beneficiary (>60% inference throughput).
- concepts/kernel-optimization-as-search — the framing primitive.
- concepts/heterogeneous-ai-accelerator-fleet — the forcing function.
- concepts/hardware-proprietary-knowledge-injection — the mechanism for MTIA support.
- concepts/in-context-reinforcement-learning — the self-evolving skill library.
- patterns/tree-search-over-llm-candidates — the search-structure pattern.
- patterns/rag-over-hardware-documentation — the hardware-knowledge-injection pattern.
- patterns/evaluation-harness-in-agent-loop — the structured-feedback pattern.
- patterns/agentic-rl-from-production-signal — the data-flywheel pattern.