PATTERN Cited by 1 source

Evaluation Harness in Agent Loop¶

Problem¶

An LLM-based code-generation agent iterates on its output by consuming feedback. Naive feedback shapes — scalar wall-clock time, binary pass/fail on a test, a single profiler trace — are insufficient for the agent to understand why a candidate is slow or broken, which means subsequent rounds vary randomly rather than converging. Worse, at production scale the evaluation itself must absorb multi-minute build cycles + infrastructure failures, or the agent loop never makes progress.

Shape¶

Build a multi-layer evaluation harness that composes multiple profiling + validation tools into a single structured diagnostic output:

Correctness layer — bitwise-compare against a reference implementation (PyTorch for GPU kernels).
Performance layer — end-to-end speedup on realistic production input shapes.
System-level profiling layer — kernel-launch overhead, host-device synchronization, stream behavior (PyTorch Profiler scope).
Per-kernel hardware metrics layer — occupancy, memory throughput, instruction mix (NCU scope).
Intra-kernel instruction-level layer — pipeline behavior, warp stall reasons, latency hot spots (Proton scope).
Accelerator-specific layer — proprietary-silicon counters (MTIA Insight: PE utilization, DPE/SFU/MLU stall cycles, per-PE memory bandwidth).

Compose these through a compiler-centric abstraction — job graphs:

Compiler transforms insert MLIR-level instrumentation.
Profiling passes collect metrics.
Trace synthesis produces structured output.

The structured diagnostic output — not scalar wall-clock — is what feeds back into the agent's next round. The search engine doesn't just see "kernel A is 1.2× faster than kernel B"; it sees "kernel A is memory-bound, kernel B is compute-bound" and directs the LLM synthesizer accordingly.

Canonical instance — Meta KernelEvolve (2026-04-02)¶

Meta's KernelEvolve Automated Evaluation Framework is the canonical wiki instance. Composition:

TritonBench — correctness + end-to-end speedup against PyTorch baselines.
PyTorch Profiler — system-level timeline.
NCU — per-kernel GPU metrics (NVIDIA).
Proton — intra-kernel instruction-level latency (NVIDIA).
MTIA Insight — accelerator-specific counters (MTIA).

Meta's canonical statement:

"The search engine doesn't just see 'kernel A is 1.2x faster than kernel B' — it sees why: whether the bottleneck is memory-bound, compute-bound, or limited by occupancy — and feeds that diagnostic signal back into the LLM synthesizer to guide the next round of candidates."

And on the harness's operational role:

"Under the hood, a purpose-built long-running job harness drives each iteration – compiling candidates, evaluating correctness and performance, profiling hardware utilization, and generating analysis reports – all while handling the multi-minute build cycles and infrastructure failures that make naive approaches impractical."

Why structured > scalar feedback¶

Three reasons:

The LLM needs the why to course-correct. "This kernel is memory-bound" points the synthesizer at tiling / prefetching / memory-layout transformations. "This kernel is 1.2× slower" doesn't.
The search engine needs structured gradings to choose node expansions. "Compute-bound + occupancy 40%" tells MCTS to try expanding toward tile-size + scheduling transformations; a scalar loss provides no such directional signal.
Candidates that are slower in aggregate may be architecturally promising. A kernel that's slower overall but hits 90% occupancy is often one fix away from a big win; the grading surfaces that.

Consequences¶

Positive:

Iteration converges — each round applies transformations targeted at the actual bottleneck, not random variations.
Benchmark headline numbers come from rare wins — KernelEvolve's 100% KernelBench pass rate depends on every candidate being evaluated deeply enough to catch both correctness and performance failures at the right layer.
Multi-platform — the same harness architecture scopes across NVIDIA + AMD + MTIA + CPU; each platform's profilers are plug-ins into the job-graph framework.

Negative / care required:

Harness engineering is expensive — Meta built this harness specifically; it's "purpose-built" to absorb multi-minute build cycles + infra failures. Naive implementations pay iteration-budget tax.
Profiling signal must be reliable — if the harness mis-classifies memory-bound as compute-bound, the agent's subsequent rounds chase the wrong fix.
Parallel evaluation is required for practical iteration speed — Meta runs "hundreds of candidates in parallel" on distributed infrastructure.

Seen in¶

Meta KernelEvolve (2026-04-02, canonical). First wiki canonicalisation of multi-layer structured-profiling-in-agent-loop at hyperscale. (Source: sources/2026-04-02-meta-kernelevolve-how-metas-ranking-engineer-agent-optimizes-ai-infrastructure)

concepts/agentic-kernel-synthesis — the system-level pattern this is an essential component of.
concepts/kernel-optimization-as-search — the algorithmic framing whose feedback loop this pattern closes.
systems/kernelevolve — production instance.
systems/tritonbench / systems/nvidia-ncu / systems/proton-profiler — constituent tools.
patterns/tree-search-over-llm-candidates — the pattern whose node-expansion policy consumes this harness's output.
patterns/agentic-rl-from-production-signal — the evaluation-harness output also becomes training data.
companies/meta — canonicalising source.