Iterative prompt refinement¶

CONCEPT Cited by 4 sources

Definition¶

Iterative prompt refinement is the control-loop pattern where a generative model's output is scored against a rubric (typically by a judge LLM or VLM), and on fail the failing-rubric-dimensions are fed back into the prompt generator to produce a revised prompt, which re-runs the generator — continuing until the output passes the threshold or a round budget is exhausted.

Structurally it is a closed-loop control system where the judge's failure signal becomes the next prompt's input, not a discard-and-retry.

The four-step reference loop (from Instacart PIXEL)¶

Generate first-pass output with a starting prompt.
Score the output against curated evaluation questions (typically generated by a judge LLM for the project).
Decide: pass threshold met → ship; otherwise → continue.
Refine: feed the failed questions' text back into the prompt-generator LLM so the revised prompt addresses the gap. Return to step 1.

See also concepts/vlm-as-image-judge (image-scoring side of this loop) + concepts/llm-as-judge (text-scoring generalisation) + concepts/iterative-plan-refinement (plan-sufficiency sibling from DS-STAR).

Why feeding failure signal into the prompt beats¶

discard-and-retry

Naive discard-and-retry samples the generator's output distribution again with the same prompt — progress depends on stochastic luck.

Feeding failed-question text into the prompt shifts the distribution itself toward outputs that address the missing dimension. "Warm neutral background" failure → next-round prompt explicitly names lighting + background. The judge's signal is the gradient; the prompt generator is the optimiser.

Tradeoffs / gotchas¶

Budget is load-bearing. Without a round budget, the loop can consume unbounded inference cost on edge cases the generator structurally can't hit. See concepts/refinement-round-budget for the analogous DS-STAR framing (10-round cap).
Prompt drift. Long refinement chains can introduce contradictions as each round adds constraints without pruning. Occasional reset-to-baseline is often necessary.
Judge agreement with humans. If the judge's rubric diverges from human preference, refinement optimises for the judge not the user. PIXEL's 20% → 85% human-judge approval-rate figure is the characterisation of this alignment at Instacart's specific domain.
Cost compounds. A 3-round median refinement loop is 3× the inference cost of single-shot generation.

Seen in¶

sources/2025-07-17-instacart-introducing-pixel-instacarts-unified-image-generation-platform — canonical wiki instance at the image-generation layer. Instacart PIXEL ships the full four-step loop: LLM-prompt → generate → VLM-score → failed-questions-into-prompt → regenerate. Raises human-judge approval rate from 20% → 85%. Round-budget + convergence- iteration-count not disclosed. "We generate a first pass of images with a prompt generated by LLM. We judge the image output using a curated set of evaluation questions that are generated by an LLM, based on the project needs. We then pass the questions and the image to a VLM for evaluation. We make a decision whether or not to use the image based on the number of questions which passed from the evaluation. If the image fails the evaluation, we incorporate the failed questions into the prompt generator LLM to generate a revised prompt for the image generation model and we repeat these steps until the image passes our threshold."
sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms — text-extraction application of the same loop-shape. Instacart PARSE makes prompt iteration a first-class operation: a "simple" attribute (organic claim) hits 95% accuracy on the first prompt (1 day vs. 1 week traditional); a "complex" attribute (low_sugar claim) requires multiple iterations (3 days via the PARSE UI). The iteration itself is human-authored rather than loop-driven — but PARSE's roadmap cites the same literature ([6] "LARGE LANGUAGE MODELS AS OPTIMIZERS", [7] "EVOPROMPT") as the direction for closing the loop automatically. Canonical wiki instance of the effort-per-attribute framing: prompt-tuning time and LLM-size choice are both attribute-dependent decisions.
sources/2026-02-19-lyft-scaling-localization-with-ai — canonical wiki instance at the text-translation layer. Lyft's AI localization pipeline ships the loop as a Drafter + Evaluator pair: the Drafter generates 3 candidates per source string (see patterns/multi-candidate-generation); the Evaluator grades each on a 4-dim rubric (accuracy/clarity, fluency/adaptation, brand alignment, technical correctness); on all-fail the per-candidate critique text feeds back to the Drafter for another attempt, up to 3 total. Canonical on-wiki defence of the "separate Drafter from Evaluator to break self-approval bias" argument. Before/after quality numbers not disclosed (contrast PIXEL's 20% → 85%).

concepts/vlm-as-image-judge — the image-scoring step in the loop
concepts/llm-as-judge — text-output generalisation
concepts/iterative-plan-refinement — plan-sufficiency sibling (DS-STAR)
concepts/refinement-round-budget — the bounding mechanism
concepts/machine-translation-with-llms — text-translation specialisation
concepts/self-approval-bias — the bias this loop's role-separation mitigates
patterns/vlm-evaluator-quality-gate — the pattern this concept instantiates (image modality)
patterns/drafter-evaluator-refinement-loop — the pattern this concept instantiates (text / structured-output modality)
patterns/multi-candidate-generation — N-candidates subroutine used in both
patterns/prompt-optimizer-flywheel — the longer-feedback-loop analog with human input
systems/instacart-pixel — canonical production instance (image)
systems/lyft-ai-localization-pipeline — canonical production instance (text translation)
companies/instacart
companies/lyft