Skip to content

PATTERN Cited by 1 source

Plan-Mode-then-implement agent loop

Problem

Unattended coding agents working on production-critical code — performance optimisation, refactors in widely-used infrastructure, schema migrations — exhibit five characteristic failure modes:

  1. Hyperfixation on first hypothesis.
  2. Microbenchmark-vs-end-to-end gap (97 % microbench win, 0.02 % real-world win).
  3. No dogfood-loop awareness (doesn't test the system end-to-end even when the system supports it).
  4. No regression tests written.
  5. Doesn't use the --profile / diagnostic tools available to it.

(All five from Anthony Shew's 8-agent-phone-spawn review in the 2026-04-21 Turborepo post.)

A fully-autonomous "Ralph-Wiggum loop" (see ghuntley.com/ralph) — agent loops indefinitely against its own output — is attractive in principle but too unreliable with current model+harness dependability for production- critical work. Canonical Shew rejection verbatim: "The combination of the model, the harness, and the loop simply weren't dependable enough, and could move so much code out from underneath me too quickly. Maybe if I were working on a sideproject, I would have accepted it, but Turborepo powers some of the largest repositories in the world. I have to be fast and responsible."

The pattern

Separate the agent loop into three distinct stages with explicit human gates:

                 ┌──────────────────────────────────┐
                 │                                  │
                 ▼                                  │
   1. Agent (Plan Mode) profiles + analyses         │
                 │                                  │
                 ▼                                  │
   2. HUMAN reviews proposals ← rejects bad ones    │
                 │                                  │
                 ▼                                  │
   3. Agent implements the approved change          │
                 │                                  │
                 ▼                                  │
   4. hyperfine + sandbox end-to-end A/B validation │
                 │                                  │
                 ▼                                  │
   5. PR → code review → merge                      │
                 │                                  │
                 └──────────────(repeat)────────────┘

The three agent stages (analyse, implement, validate) are each single-task agent invocations, not a continuous loop; between them, the human decides what goes forward.

Full canonical workflow (Turborepo campaign)

Anthony Shew's exact workflow verbatim:

"1. Put the agent in Plan Mode with instructions to create a profile and find hotspots in the Markdown output 2. Review the proposed optimizations and decide which ones were worth pursuing 3. Have the agent implement the good proposal(s) 4. Validate with end-to-end hyperfine benchmarks 5. Make a PR 6. Repeat"

Produced 20+ performance PRs in four days on the Turborepo Rust codebase — substantially more throughput than unaided engineering, substantially less risk than unattended autonomy.

Properties

  • Agent handles mechanical work — generating Markdown profile analysis, writing optimisation code, running benchmarks. Each is a narrowly-scoped task the agent is good at.
  • Human handles judgement calls — picking which hypothesis to pursue, rejecting hyperfixation, catching microbenchmark-vs-end-to-end pathologies.
  • End-to-end validation gatesandbox hyperfine A/B is the non-negotiable step; only real wall-clock wins proceed.
  • One PR per approved proposal — no multi-change squashed commits, which preserves review-ability.
  • Stateless across iterations — each cycle is a fresh agent conversation; the source code itself carries the cumulative improvements into subsequent iterations.

Why Plan Mode specifically

Modern agent harnesses (Cursor, Claude Code, some Codex configurations) expose a Plan Mode that produces a written hypothesis / plan before writing code. The structural value:

  • The agent is forced to verbalise its reasoning, which makes hyperfixation visible (the agent's plan shows which hypothesis it's committed to, before code is written).
  • Reviewing a plan is faster than reviewing code — the human gate is cheap.
  • Bad plans can be rejected with a single sentence instead of reverted via PR.

Distinction from agent-driven-benchmark-loop

Agent-driven benchmark loop (Cloudflare Agent Memory, 2026-04-17) and this pattern share the human-gate-on-proposal- selection discipline but diverge on the validation axis:

agent-driven-benchmark-loop plan-mode-then-implement
Domain Retrieval / extraction / ranking-quality Performance engineering
Validation metric Benchmark score (LoCoMo, LongMemEval) Wall-clock end-to-end latency (hyperfine)
Failure mode gated Overfitting the benchmark Microbench-vs-end-to-end gap
Agent task Propose architectural changes Propose + implement specific optimisations
Loop frequency Multi-run per iteration (stochasticity) Single-run-per-iteration (deterministic wall-clock)

The two patterns are siblings at different discipline altitudes; performance-engineering Plan-Mode is the wall-clock-validated variant of the quality- benchmark-validated loop.

Distinction from unsupervised fan-out

Agent-spawn parallel exploration — Shew's overnight 8-agent experiment that preceded the supervised loop — is the unsupervised counterpart:

agent-spawn-parallel plan-mode-then-implement
Supervision None Human gates each iteration
Hypothesis source N parallel agents generate independently One agent proposes, human selects
Yield rate ~37 % (3 of 8 became shippable) High (humans filter before implementation)
Risk Low (sleep through it; prune in the morning) Moderate (human must be present)
Good for Exploration at low prompt quality Execution at known hot path

They compose — use spawn-parallel to explore the hypothesis space, then use Plan-Mode-then-implement to execute the chosen hypotheses.

Anti-patterns

  • Skip the Plan Mode step. Agent implements without a verbalised plan → human has to read the generated code to figure out what the agent thought it was doing → bad proposals waste more time than good plans save.
  • Skip the end-to-end validation gate. Agents optimise the microbenchmark, not the system; without hyperfine A/B the loop degrades to microbench optimisation.
  • Multi-change proposals. Agent proposes "3 changes in one PR" — review-ability drops, one bad change contaminates the good ones.
  • No human gate. Becomes a Ralph-Wiggum loop with all its failure modes. Works for throwaway code, not production-critical code at current model dependability.
  • Too-coarse validation workload. If the hyperfine 'turbo run build --dry' workload isn't representative of the target use case, end-to-end validation gives false positives — the optimisation wins on the benchmark workload but not on the production workload.

Composition

This pattern is the orchestration layer that composes with:

Seen in

  • Making Turborepo 96 % faster (Vercel, 2026-04-21) — canonical wiki instance; definitional source; 20+ PRs in 4 days; explicit five-step loop canonicalised verbatim. The Ralph-Wiggum rejection framing is here too.
Last updated · 476 distilled / 1,218 read