PATTERN Cited by 1 source
Evaluation Harness in Agent Loop¶
Problem¶
An LLM-based code-generation agent iterates on its output by consuming feedback. Naive feedback shapes — scalar wall-clock time, binary pass/fail on a test, a single profiler trace — are insufficient for the agent to understand why a candidate is slow or broken, which means subsequent rounds vary randomly rather than converging. Worse, at production scale the evaluation itself must absorb multi-minute build cycles + infrastructure failures, or the agent loop never makes progress.
Shape¶
Build a multi-layer evaluation harness that composes multiple profiling + validation tools into a single structured diagnostic output:
- Correctness layer — bitwise-compare against a reference implementation (PyTorch for GPU kernels).
- Performance layer — end-to-end speedup on realistic production input shapes.
- System-level profiling layer — kernel-launch overhead, host-device synchronization, stream behavior (PyTorch Profiler scope).
- Per-kernel hardware metrics layer — occupancy, memory throughput, instruction mix (NCU scope).
- Intra-kernel instruction-level layer — pipeline behavior, warp stall reasons, latency hot spots (Proton scope).
- Accelerator-specific layer — proprietary-silicon counters (MTIA Insight: PE utilization, DPE/SFU/MLU stall cycles, per-PE memory bandwidth).
Compose these through a compiler-centric abstraction — job graphs:
- Compiler transforms insert MLIR-level instrumentation.
- Profiling passes collect metrics.
- Trace synthesis produces structured output.
The structured diagnostic output — not scalar wall-clock — is what feeds back into the agent's next round. The search engine doesn't just see "kernel A is 1.2× faster than kernel B"; it sees "kernel A is memory-bound, kernel B is compute-bound" and directs the LLM synthesizer accordingly.
Canonical instance — Meta KernelEvolve (2026-04-02)¶
Meta's KernelEvolve Automated Evaluation Framework is the canonical wiki instance. Composition:
- TritonBench — correctness + end-to-end speedup against PyTorch baselines.
- PyTorch Profiler — system-level timeline.
- NCU — per-kernel GPU metrics (NVIDIA).
- Proton — intra-kernel instruction-level latency (NVIDIA).
- MTIA Insight — accelerator-specific counters (MTIA).
Meta's canonical statement:
"The search engine doesn't just see 'kernel A is 1.2x faster than kernel B' — it sees why: whether the bottleneck is memory-bound, compute-bound, or limited by occupancy — and feeds that diagnostic signal back into the LLM synthesizer to guide the next round of candidates."
And on the harness's operational role:
"Under the hood, a purpose-built long-running job harness drives each iteration – compiling candidates, evaluating correctness and performance, profiling hardware utilization, and generating analysis reports – all while handling the multi-minute build cycles and infrastructure failures that make naive approaches impractical."
Why structured > scalar feedback¶
Three reasons:
- The LLM needs the why to course-correct. "This kernel is memory-bound" points the synthesizer at tiling / prefetching / memory-layout transformations. "This kernel is 1.2× slower" doesn't.
- The search engine needs structured gradings to choose node expansions. "Compute-bound + occupancy 40%" tells MCTS to try expanding toward tile-size + scheduling transformations; a scalar loss provides no such directional signal.
- Candidates that are slower in aggregate may be architecturally promising. A kernel that's slower overall but hits 90% occupancy is often one fix away from a big win; the grading surfaces that.
Consequences¶
Positive:
- Iteration converges — each round applies transformations targeted at the actual bottleneck, not random variations.
- Benchmark headline numbers come from rare wins — KernelEvolve's 100% KernelBench pass rate depends on every candidate being evaluated deeply enough to catch both correctness and performance failures at the right layer.
- Multi-platform — the same harness architecture scopes across NVIDIA + AMD + MTIA + CPU; each platform's profilers are plug-ins into the job-graph framework.
Negative / care required:
- Harness engineering is expensive — Meta built this harness specifically; it's "purpose-built" to absorb multi-minute build cycles + infra failures. Naive implementations pay iteration-budget tax.
- Profiling signal must be reliable — if the harness mis-classifies memory-bound as compute-bound, the agent's subsequent rounds chase the wrong fix.
- Parallel evaluation is required for practical iteration speed — Meta runs "hundreds of candidates in parallel" on distributed infrastructure.
Seen in¶
- Meta KernelEvolve (2026-04-02, canonical). First wiki canonicalisation of multi-layer structured-profiling-in-agent-loop at hyperscale. (Source: sources/2026-04-02-meta-kernelevolve-how-metas-ranking-engineer-agent-optimizes-ai-infrastructure)
Related¶
- concepts/agentic-kernel-synthesis — the system-level pattern this is an essential component of.
- concepts/kernel-optimization-as-search — the algorithmic framing whose feedback loop this pattern closes.
- systems/kernelevolve — production instance.
- systems/tritonbench / systems/nvidia-ncu / systems/proton-profiler — constituent tools.
- patterns/tree-search-over-llm-candidates — the pattern whose node-expansion policy consumes this harness's output.
- patterns/agentic-rl-from-production-signal — the evaluation-harness output also becomes training data.
- companies/meta — canonicalising source.