PATTERN Cited by 1 source
Agent-driven benchmark loop¶
Problem¶
Shipping a quality-sensitive ML-adjacent system (retrieval, extraction, classification, synthesis) requires fast iteration on architecture and prompts against a measurable ground-truth benchmark. The naive workflow — engineer reads benchmark output, manually proposes fix, edits code, re-runs benchmark — is slow in two ways:
- Wall-clock slow. Large benchmark suites take time; stochasticity
(even at
temperature=0, LLM outputs vary run-to-run) means every proposed change needs multiple runs to distinguish real improvement from noise. - Idea-throughput slow. Engineers can only propose a handful of hypotheses per day; the solution space of prompt tweaks / pipeline reorderings / channel additions / weight retunings is enormous.
An agent can propose hypotheses fast, but left alone it will overfit the benchmark: tuning to the specific question shapes in the test set rather than the generalising properties of the design.
The pattern¶
Run a closed loop that mixes fast agent proposal with slow human gate:
┌─────────────────────────────────────────────────────────────┐
│ │
│ 1. run benchmarks │
│ │ │
│ ▼ │
│ 2. analyse where we had gaps │
│ │ │
│ ▼ │
│ 3. AGENT proposes fixes │
│ │ │
│ ▼ │
│ 4. HUMAN reviews proposals, selects strategies that │
│ GENERALISE rather than overfit │
│ │ │
│ ▼ │
│ 5. AGENT makes the selected changes to the code / prompts │
│ │ │
│ ▼ │
└─────────┘ (back to 1)
Load-bearing properties:
- Agent handles high-throughput work. Proposing fixes + making code changes both become agent tasks — the cycle can loop much faster than it could with only human hands on the code.
- Human gates the proposal selection. The human's one job is filtering generalising fixes from benchmark-overfitting fixes. The agent doesn't have a strong internal prior on what overfits; the human does.
- Multiple benchmarks, not one. Running against a suite of independent benchmarks — each testing different things — is the structural defense against benchmark-specific overfitting. If a change improves one benchmark and regresses another, the human spots it.
- Stochasticity handled explicitly. Multiple runs + trend analysis instead of trusting single-run scores. Any proposed gain needs distributional evidence.
- Trend analysis alongside raw scores. Raw benchmark scores are noisy; "is the trend going up consistently across iterations?" is a more robust signal than "is today's score above yesterday's?"
Canonical wiki instance: Cloudflare Agent Memory¶
The Agent Memory team described this loop explicitly:
"So we put it into an agent-driven loop and iterated. The cycle looked like this: run benchmarks, analyze where we had gaps, propose solutions, have a human review the proposals to select strategies that generalize rather than overfit, let the agent make the changes, repeat."
"LLMs are stochastic, even with temperature set to zero. This caused results to vary across runs, which meant we had to average multiple runs (time-consuming for large benchmarks) and rely on trend analysis alongside raw scores to understand what was actually working."
"Along the way we had to guard carefully against overfitting the benchmarks in ways that didn't genuinely make the product better for the general case."
"We intentionally tested against multiple benchmarks (including LoCoMo, LongMemEval, and BEAM) to push the system in different ways."
The benchmark stack (public benchmarks chosen deliberately for independence):
- LongMemEval (arxiv:2410.10813) — long-conversation memory evaluation.
- LoCoMo (arxiv:2402.17753) — long-form conversational memory.
- BEAM (arxiv:2510.27246) — broader memory evaluation suite.
Why human-in-the-loop on proposal selection specifically¶
Overfitting is recognisable from the fix, not from the score. A fix that hardcodes handling of a LoCoMo-specific question pattern will still produce a benchmark win — but a human reading the diff can see "this isn't a design change, it's a lookup table for the benchmark" and reject it.
Leaving the agent unsupervised on this step degrades the benchmark from a measurement instrument into a gradient signal the agent is directly optimising, which is a classic failure mode in ML-model training and translates 1:1 to prompt / pipeline iteration.
Anti-patterns¶
- Single-benchmark tuning. If only one benchmark is in the loop, every generalisation failure mode goes undetected. Multi-benchmark is a structural requirement.
- Agent approves its own proposals. Removes the only filter for overfit-shaped fixes.
- No distributional scoring. Treating a single run's score as evidence of improvement lets noise drive architecture changes.
- Treat benchmarks as leaderboard, not diagnostic. The loop works when benchmarks are used to find gaps — if they become the product goal, the whole discipline collapses back into overfitting.
Relation to LLM-as-judge evaluation¶
This pattern composes with LLM-as-judge. The benchmark can itself use LLM-as-judge for scoring (particularly on open-ended questions where exact-match is wrong); the generalising-vs-overfitting filter is independent of how the benchmark produces its scores.
Seen in¶
- sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory — canonical wiki instance; explicitly described loop, named benchmarks (LongMemEval, LoCoMo, BEAM), overfitting-vs-generalising discipline made load-bearing.
Related¶
- concepts/benchmark-methodology-bias — the failure mode this pattern's human-gate step filters out.
- concepts/llm-as-judge — complementary evaluation technique that can be used inside the benchmark.
- patterns/groundtruth-upper-bound-benchmark — sibling practice of establishing a benchmark ceiling before optimisation.
- systems/cloudflare-agent-memory — canonical realisation.