Skip to content

PATTERN Cited by 1 source

Evaluation Harness in Agent Loop

Problem

An LLM-based code-generation agent iterates on its output by consuming feedback. Naive feedback shapes — scalar wall-clock time, binary pass/fail on a test, a single profiler trace — are insufficient for the agent to understand why a candidate is slow or broken, which means subsequent rounds vary randomly rather than converging. Worse, at production scale the evaluation itself must absorb multi-minute build cycles + infrastructure failures, or the agent loop never makes progress.

Shape

Build a multi-layer evaluation harness that composes multiple profiling + validation tools into a single structured diagnostic output:

  • Correctness layer — bitwise-compare against a reference implementation (PyTorch for GPU kernels).
  • Performance layer — end-to-end speedup on realistic production input shapes.
  • System-level profiling layer — kernel-launch overhead, host-device synchronization, stream behavior (PyTorch Profiler scope).
  • Per-kernel hardware metrics layer — occupancy, memory throughput, instruction mix (NCU scope).
  • Intra-kernel instruction-level layer — pipeline behavior, warp stall reasons, latency hot spots (Proton scope).
  • Accelerator-specific layer — proprietary-silicon counters (MTIA Insight: PE utilization, DPE/SFU/MLU stall cycles, per-PE memory bandwidth).

Compose these through a compiler-centric abstractionjob graphs:

  • Compiler transforms insert MLIR-level instrumentation.
  • Profiling passes collect metrics.
  • Trace synthesis produces structured output.

The structured diagnostic output — not scalar wall-clock — is what feeds back into the agent's next round. The search engine doesn't just see "kernel A is 1.2× faster than kernel B"; it sees "kernel A is memory-bound, kernel B is compute-bound" and directs the LLM synthesizer accordingly.

Canonical instance — Meta KernelEvolve (2026-04-02)

Meta's KernelEvolve Automated Evaluation Framework is the canonical wiki instance. Composition:

  • TritonBench — correctness + end-to-end speedup against PyTorch baselines.
  • PyTorch Profiler — system-level timeline.
  • NCU — per-kernel GPU metrics (NVIDIA).
  • Proton — intra-kernel instruction-level latency (NVIDIA).
  • MTIA Insight — accelerator-specific counters (MTIA).

Meta's canonical statement:

"The search engine doesn't just see 'kernel A is 1.2x faster than kernel B' — it sees why: whether the bottleneck is memory-bound, compute-bound, or limited by occupancy — and feeds that diagnostic signal back into the LLM synthesizer to guide the next round of candidates."

And on the harness's operational role:

"Under the hood, a purpose-built long-running job harness drives each iteration – compiling candidates, evaluating correctness and performance, profiling hardware utilization, and generating analysis reports – all while handling the multi-minute build cycles and infrastructure failures that make naive approaches impractical."

Why structured > scalar feedback

Three reasons:

  1. The LLM needs the why to course-correct. "This kernel is memory-bound" points the synthesizer at tiling / prefetching / memory-layout transformations. "This kernel is 1.2× slower" doesn't.
  2. The search engine needs structured gradings to choose node expansions. "Compute-bound + occupancy 40%" tells MCTS to try expanding toward tile-size + scheduling transformations; a scalar loss provides no such directional signal.
  3. Candidates that are slower in aggregate may be architecturally promising. A kernel that's slower overall but hits 90% occupancy is often one fix away from a big win; the grading surfaces that.

Consequences

Positive:

  • Iteration converges — each round applies transformations targeted at the actual bottleneck, not random variations.
  • Benchmark headline numbers come from rare wins — KernelEvolve's 100% KernelBench pass rate depends on every candidate being evaluated deeply enough to catch both correctness and performance failures at the right layer.
  • Multi-platform — the same harness architecture scopes across NVIDIA + AMD + MTIA + CPU; each platform's profilers are plug-ins into the job-graph framework.

Negative / care required:

  • Harness engineering is expensive — Meta built this harness specifically; it's "purpose-built" to absorb multi-minute build cycles + infra failures. Naive implementations pay iteration-budget tax.
  • Profiling signal must be reliable — if the harness mis-classifies memory-bound as compute-bound, the agent's subsequent rounds chase the wrong fix.
  • Parallel evaluation is required for practical iteration speed — Meta runs "hundreds of candidates in parallel" on distributed infrastructure.

Seen in

Last updated · 550 distilled / 1,221 read