Skip to content

CONCEPT Cited by 1 source

Refinement-round budget

Definition

Refinement-round budget is the bounded-iteration discipline of a judge-gated agent loop: every loop has a hard ceiling on the number of plan → implement → verify → refine cycles it may run, and the loop terminates at either judge satisfaction or budget exhaustion — whichever comes first.

The concept is a safety-net primitive for iterative plan refinement. Without a ceiling, non-converging loops (Verifier keeps rejecting but Router-driven fixes don't fix the underlying issue) run unboundedly; with one, cost is bounded but some inputs may return unfinished work.

Canonical numeric anchor

DS-STAR publishes the most detailed round-budget numbers on the wiki:

Parameter Value Source
Maximum rounds 10 DS-STAR loop spec
Avg rounds, easy DABStep tasks 3.0 empirical
Avg rounds, hard DABStep tasks 5.6 empirical
Share of easy tasks completing in 1 round >50 % empirical

"over half of the easy tasks were completed in just a single round" (Source: sources/2025-11-06-google-ds-star-versatile-data-science-agent).

Shape of the distribution

Round count is difficulty-conditioned:

  • Easy tasks (single file, answer locally extractable): distribution bunched at 1 round, tailing off.
  • Hard tasks (multiple files, cross-file reasoning): distribution centred further out, averaging nearly double the easy case.

The ceiling (10) is well above the hard-task average (5.6), so the loop rarely exhausts budget on DABStep hard — but the presence of the ceiling still matters for pathological inputs or judge failures.

Why it matters

  • Cost bounding. Each round = Planner + Coder + Verifier (+ Router on reject) inference. A runaway loop is expensive.
  • Latency bounding. For user-facing agents, a budget ceiling translates to a worst-case response time.
  • Failure mode articulation. Budget-exhaustion is a distinct failure mode from judge-rejected; end-user UX must distinguish "the agent tried 10 times and couldn't confirm a plan" from "the agent refused the task."

Tradeoffs / gotchas

  • Ceiling calibration is empirical. DS-STAR's 10 is framed as the safety ceiling, far above the 5.6 avg; a lower ceiling would cap cost more aggressively at the price of timing out on tail-harder inputs.
  • On-budget-exhaustion behaviour is under-specified. The DS-STAR post says "the final code is delivered as the solution" on reaching the max rounds — i.e. return best-effort even without Verifier approval. Other designs might error, retry, or escalate; the choice is a product UX decision.
  • Doesn't catch judge-calibration drift. If the Verifier silently over-approves, rounds used drops and the budget looks healthy — but answers are worse. The budget is a cost control, not a correctness check.

Seen in

Last updated · 200 distilled / 1,178 read