Iterative plan refinement¶

CONCEPT Cited by 1 source

Definition¶

Iterative plan refinement is the agent-loop discipline where a Planner LLM produces a plan, a Coder LLM executes it, a Verifier LLM judges plan sufficiency, and on judgment-fail the plan is either extended or repaired before the loop repeats — bounded by a refinement-round budget.

The shape is named against two failure modes it replaces:

One-shot plan-and-generate. Commit to the whole plan up front, execute, return whatever comes out. Breaks on open-ended problems without ground-truth labels, where the plan may be architecturally wrong and there is no automated way to detect that from the output.
Extend-only refinement. Run a step, check if the output suffices, if not add a new step, repeat. Accumulates flawed steps — DS-STAR's ablation Variant 2 removed the fix-vs-add decision and measured worse performance on both easy and hard tasks, "it is more effective to correct mistakes in a plan than to keep adding potentially flawed steps" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).

Canonical shape¶

Plan — Planner agent emits a high-level plan.
Implement — Coder agent turns the plan into executable code and runs it, capturing intermediate results.
Judge — Verifier agent (LLM judge) inspects the plan and intermediate results; returns sufficient or insufficient.
Branch — if insufficient, a Router agent chooses between:
Extend — append a new step.
Fix — repair or replace an existing step.
Repeat — go to step 2 with the updated plan.
Terminate — when the Verifier passes, or the budget is exhausted.

Why it matters¶

Open-ended problems lack ground truth. Data-science and analytics tasks frequently have no single "correct" answer; a judge operating on plan structure is the only feasible proxy for "would an expert agree this is the right approach?"
Intermediate inspection mimics expert workflow. DS-STAR's post names Google Colab explicitly: "mimics how an expert analyst uses tools like Google colab to build a plan sequentially, reviewing intermediate results before proceeding" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).
Hard problems take more rounds; easy problems finish fast. The round-count distribution is empirically difficulty-conditioned — 3.0 avg rounds on easy DABStep tasks vs 5.6 on hard, with

50 % of easy tasks completing in a single round. See concepts/refinement-round-budget.

Distinguishing primitive: add-or-fix branch¶

The Router's add-vs-fix decision is what makes iterative plan refinement different from extend-only agent loops. Removing the Router — forcing the system to only append steps — is DS-STAR's ablation Variant 2, and it degrades both easy and hard task performance (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).

Operationally: extend-only loops accumulate their mistakes; add-or-fix loops revise them. The latter is robust to bad steps early in the plan; the former is not.

Tradeoffs / gotchas¶

The judge is expensive. Every round pays a Planner + Coder + Verifier (+ Router on fail) inference. A 5.6-round hard-task run is at least 5 Verifier evaluations on top of 5 Planner + 5 Coder calls — cost model matters.
The judge can be wrong. A Verifier that over-approves finishes fast and ships bad answers; one that under-approves drives the loop to the round budget ceiling. Calibration against human graders is prerequisite (not disclosed in the DS-STAR post).
Budget ceiling as escape valve. The 10-round ceiling is a bounded-iteration safety net against non-converging loops; the handling on budget-exhaustion (return best-so-far? return insufficient?) shapes end-user failure modes and is often under-specified.
Rich pre-loop context matters. DS-STAR's ablation shows the Data File Analyzer — a pre-loop context extractor — is load-bearing. Without it, hard-task accuracy collapses from 45.2 % to 26.98 %. The Planner's first plan is only as good as the context it starts from; iterative refinement can recover some context gaps but not all. See concepts/data-file-analysis.

Seen in¶

sources/2025-11-06-google-ds-star-versatile-data-science-agent — canonical wiki instance. DS-STAR's four-agent Planner / Coder / Verifier / Router loop bounded to 10 refinement rounds; state-of-the-art results on DABStep, KramaBench, and DA-Code; Router's add-vs-fix decision experimentally isolated as the distinguishing primitive.

patterns/planner-coder-verifier-router-loop — the pattern this concept names; four-agent-plus-router architectural shape.
concepts/llm-as-judge — the Verifier's primitive; this concept is the loop-level framing, LLM-as-judge is the step-level evaluation mechanism.
concepts/refinement-round-budget — the bounded-iteration discipline iterative plan refinement requires.
concepts/data-file-analysis — the pre-loop context primitive that makes the first plan non-terrible.
systems/ds-star — the canonical production-research instance.
patterns/specialized-agent-decomposition — sibling pattern; iterative plan refinement is the verification-gated inner-loop variant, specialised-agent-decomposition is the cross-domain routed variant.