PATTERN Cited by 1 source

Plan-Mode-then-implement agent loop¶

Problem¶

Unattended coding agents working on production-critical code — performance optimisation, refactors in widely-used infrastructure, schema migrations — exhibit five characteristic failure modes:

Hyperfixation on first hypothesis.
Microbenchmark-vs-end-to-end gap (97 % microbench win, 0.02 % real-world win).
No dogfood-loop awareness (doesn't test the system end-to-end even when the system supports it).
No regression tests written.
Doesn't use the --profile / diagnostic tools available to it.

(All five from Anthony Shew's 8-agent-phone-spawn review in the 2026-04-21 Turborepo post.)

A fully-autonomous "Ralph-Wiggum loop" (see ghuntley.com/ralph) — agent loops indefinitely against its own output — is attractive in principle but too unreliable with current model+harness dependability for production- critical work. Canonical Shew rejection verbatim: "The combination of the model, the harness, and the loop simply weren't dependable enough, and could move so much code out from underneath me too quickly. Maybe if I were working on a sideproject, I would have accepted it, but Turborepo powers some of the largest repositories in the world. I have to be fast and responsible."

The pattern¶

Separate the agent loop into three distinct stages with explicit human gates:

                 ┌──────────────────────────────────┐
                 │                                  │
                 ▼                                  │
   1. Agent (Plan Mode) profiles + analyses         │
                 │                                  │
                 ▼                                  │
   2. HUMAN reviews proposals ← rejects bad ones    │
                 │                                  │
                 ▼                                  │
   3. Agent implements the approved change          │
                 │                                  │
                 ▼                                  │
   4. hyperfine + sandbox end-to-end A/B validation │
                 │                                  │
                 ▼                                  │
   5. PR → code review → merge                      │
                 │                                  │
                 └──────────────(repeat)────────────┘

The three agent stages (analyse, implement, validate) are each single-task agent invocations, not a continuous loop; between them, the human decides what goes forward.

Full canonical workflow (Turborepo campaign)¶

Anthony Shew's exact workflow verbatim:

"1. Put the agent in Plan Mode with instructions to create a profile and find hotspots in the Markdown output 2. Review the proposed optimizations and decide which ones were worth pursuing 3. Have the agent implement the good proposal(s) 4. Validate with end-to-end hyperfine benchmarks 5. Make a PR 6. Repeat"

Produced 20+ performance PRs in four days on the Turborepo Rust codebase — substantially more throughput than unaided engineering, substantially less risk than unattended autonomy.

Properties¶

Agent handles mechanical work — generating Markdown profile analysis, writing optimisation code, running benchmarks. Each is a narrowly-scoped task the agent is good at.
Human handles judgement calls — picking which hypothesis to pursue, rejecting hyperfixation, catching microbenchmark-vs-end-to-end pathologies.
End-to-end validation gate — sandbox hyperfine A/B is the non-negotiable step; only real wall-clock wins proceed.
One PR per approved proposal — no multi-change squashed commits, which preserves review-ability.
Stateless across iterations — each cycle is a fresh agent conversation; the source code itself carries the cumulative improvements into subsequent iterations.

Why Plan Mode specifically¶

Modern agent harnesses (Cursor, Claude Code, some Codex configurations) expose a Plan Mode that produces a written hypothesis / plan before writing code. The structural value:

The agent is forced to verbalise its reasoning, which makes hyperfixation visible (the agent's plan shows which hypothesis it's committed to, before code is written).
Reviewing a plan is faster than reviewing code — the human gate is cheap.
Bad plans can be rejected with a single sentence instead of reverted via PR.

Distinction from agent-driven-benchmark-loop¶

Agent-driven benchmark loop (Cloudflare Agent Memory, 2026-04-17) and this pattern share the human-gate-on-proposal- selection discipline but diverge on the validation axis:

	agent-driven-benchmark-loop	plan-mode-then-implement
Domain	Retrieval / extraction / ranking-quality	Performance engineering
Validation metric	Benchmark score (LoCoMo, LongMemEval)	Wall-clock end-to-end latency (hyperfine)
Failure mode gated	Overfitting the benchmark	Microbench-vs-end-to-end gap
Agent task	Propose architectural changes	Propose + implement specific optimisations
Loop frequency	Multi-run per iteration (stochasticity)	Single-run-per-iteration (deterministic wall-clock)

The two patterns are siblings at different discipline altitudes; performance-engineering Plan-Mode is the wall-clock-validated variant of the quality- benchmark-validated loop.

Distinction from unsupervised fan-out¶

Agent-spawn parallel exploration — Shew's overnight 8-agent experiment that preceded the supervised loop — is the unsupervised counterpart:

	agent-spawn-parallel	plan-mode-then-implement
Supervision	None	Human gates each iteration
Hypothesis source	N parallel agents generate independently	One agent proposes, human selects
Yield rate	~37 % (3 of 8 became shippable)	High (humans filter before implementation)
Risk	Low (sleep through it; prune in the morning)	Moderate (human must be present)
Good for	Exploration at low prompt quality	Execution at known hot path

They compose — use spawn-parallel to explore the hypothesis space, then use Plan-Mode-then-implement to execute the chosen hypotheses.

Anti-patterns¶

Skip the Plan Mode step. Agent implements without a verbalised plan → human has to read the generated code to figure out what the agent thought it was doing → bad proposals waste more time than good plans save.
Skip the end-to-end validation gate. Agents optimise the microbenchmark, not the system; without hyperfine A/B the loop degrades to microbench optimisation.
Multi-change proposals. Agent proposes "3 changes in one PR" — review-ability drops, one bad change contaminates the good ones.
No human gate. Becomes a Ralph-Wiggum loop with all its failure modes. Works for throwaway code, not production-critical code at current model dependability.
Too-coarse validation workload. If the hyperfine 'turbo run build --dry' workload isn't representative of the target use case, end-to-end validation gives false positives — the optimisation wins on the benchmark workload but not on the production workload.

Composition¶

This pattern is the orchestration layer that composes with:

patterns/markdown-profile-output-for-agents — produces the Plan-Mode-consumable profile format.
patterns/ephemeral-sandbox-benchmark-pair — produces the end-to-end validation numbers.
concepts/source-code-as-agent-feedback-loop — carries improvements across iterations implicitly.

Seen in¶

Making Turborepo 96 % faster (Vercel, 2026-04-21) — canonical wiki instance; definitional source; 20+ PRs in 4 days; explicit five-step loop canonicalised verbatim. The Ralph-Wiggum rejection framing is here too.

patterns/agent-driven-benchmark-loop — sibling pattern at a different validation-metric altitude.
patterns/measurement-driven-micro-optimization — parent pattern class.
patterns/ephemeral-sandbox-benchmark-pair — the validation-gate substrate.
patterns/markdown-profile-output-for-agents — the Plan-Mode input format.
concepts/agent-hyperfixation-failure-mode / concepts/microbenchmark-vs-end-to-end-gap — failure modes the human-gate filters.