Skip to content

CONCEPT Cited by 1 source

Iterative plan refinement

Definition

Iterative plan refinement is the agent-loop discipline where a Planner LLM produces a plan, a Coder LLM executes it, a Verifier LLM judges plan sufficiency, and on judgment-fail the plan is either extended or repaired before the loop repeats — bounded by a refinement-round budget.

The shape is named against two failure modes it replaces:

  • One-shot plan-and-generate. Commit to the whole plan up front, execute, return whatever comes out. Breaks on open-ended problems without ground-truth labels, where the plan may be architecturally wrong and there is no automated way to detect that from the output.
  • Extend-only refinement. Run a step, check if the output suffices, if not add a new step, repeat. Accumulates flawed steps — DS-STAR's ablation Variant 2 removed the fix-vs-add decision and measured worse performance on both easy and hard tasks, "it is more effective to correct mistakes in a plan than to keep adding potentially flawed steps" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).

Canonical shape

  1. Plan — Planner agent emits a high-level plan.
  2. Implement — Coder agent turns the plan into executable code and runs it, capturing intermediate results.
  3. Judge — Verifier agent (LLM judge) inspects the plan and intermediate results; returns sufficient or insufficient.
  4. Branch — if insufficient, a Router agent chooses between:
  5. Extend — append a new step.
  6. Fix — repair or replace an existing step.
  7. Repeat — go to step 2 with the updated plan.
  8. Terminate — when the Verifier passes, or the budget is exhausted.

Why it matters

  • Open-ended problems lack ground truth. Data-science and analytics tasks frequently have no single "correct" answer; a judge operating on plan structure is the only feasible proxy for "would an expert agree this is the right approach?"
  • Intermediate inspection mimics expert workflow. DS-STAR's post names Google Colab explicitly: "mimics how an expert analyst uses tools like Google colab to build a plan sequentially, reviewing intermediate results before proceeding" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).
  • Hard problems take more rounds; easy problems finish fast. The round-count distribution is empirically difficulty-conditioned — 3.0 avg rounds on easy DABStep tasks vs 5.6 on hard, with

    50 % of easy tasks completing in a single round. See concepts/refinement-round-budget.

Distinguishing primitive: add-or-fix branch

The Router's add-vs-fix decision is what makes iterative plan refinement different from extend-only agent loops. Removing the Router — forcing the system to only append steps — is DS-STAR's ablation Variant 2, and it degrades both easy and hard task performance (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).

Operationally: extend-only loops accumulate their mistakes; add-or-fix loops revise them. The latter is robust to bad steps early in the plan; the former is not.

Tradeoffs / gotchas

  • The judge is expensive. Every round pays a Planner + Coder + Verifier (+ Router on fail) inference. A 5.6-round hard-task run is at least 5 Verifier evaluations on top of 5 Planner + 5 Coder calls — cost model matters.
  • The judge can be wrong. A Verifier that over-approves finishes fast and ships bad answers; one that under-approves drives the loop to the round budget ceiling. Calibration against human graders is prerequisite (not disclosed in the DS-STAR post).
  • Budget ceiling as escape valve. The 10-round ceiling is a bounded-iteration safety net against non-converging loops; the handling on budget-exhaustion (return best-so-far? return insufficient?) shapes end-user failure modes and is often under-specified.
  • Rich pre-loop context matters. DS-STAR's ablation shows the Data File Analyzer — a pre-loop context extractor — is load-bearing. Without it, hard-task accuracy collapses from 45.2 % to 26.98 %. The Planner's first plan is only as good as the context it starts from; iterative refinement can recover some context gaps but not all. See concepts/data-file-analysis.

Seen in

Last updated · 200 distilled / 1,178 read