Skip to content

CONCEPT Cited by 1 source

Iterative prompt methodology

Definition

Iterative prompt methodology is the discipline of discovering a production prompt offline, through a sequence of discrete experiments, where each experiment evaluates one prompt structure against a held-out sample, identifies the failure mode, and the next experiment is human-authored to address that mode. The output is a frozen prompt shipped into production.

Distinct from concepts/iterative-prompt-refinement, where a judge LLM feeds failure signal into a generator LLM at inference time and every production call runs the loop.

The two timescales

Iterative prompt methodology Iterative prompt refinement
When it runs Offline, during development Online, per request
Author of the next iteration Human Judge LLM
Loop termination Project delivery Pass threshold or budget exhausted
Production artefact Single frozen prompt Closed-loop generator+judge
Example Zalando UI-migration (5 experiments) Instacart PIXEL (runtime loop)
Cost implication One-time dev cost Per-request multiple

Zalando's canonical five-experiment arc

From sources/2025-02-19-zalando-llm-powered-migration-of-ui-component-libraries:

  1. Experiment 1 — source code only. Hand both source and target source code to the LLM, ask it to migrate. "This produced inconsistent results with numerous errors." Hypothesis: prompt needs the LLM to do too many intermediate steps in one pass.

  2. Experiment 2 — interface only. Pre-generate a typed interface for each component, hand the interface + file to the LLM. Still low accuracy. "Even though the interface was detailed, it lacked essential information present in the original source code that was necessary for complete component transformation." Hypothesis: interface isn't specific enough; need explicit mapping.

  3. Experiment 3 — interface + auto-mapping. Hand the interface plus an LLM-generated mapping (source attribute → target attribute). "The code was transformed with medium accuracy, but revealed flaws in the automated mapping instructions." Canonical failure: the size="medium" → size="medium" direct-name mapping when the visually-correct mapping is size="medium" → size="large". Hypothesis: mapping needs human verification.

  4. Experiment 4 — interface + manually-verified mapping. Pair programmers + designers verify every attribute mapping against rendered outputs. "This improved accuracy even further for transforming basic components, but for complex components requiring substantial code restructuring it still had issues." Hypothesis: abstract rules aren't concrete enough; need worked examples.

  5. Experiment 5 — + examples. Add worked input/output code samples with migration notes. "The code was transformed with a high degree of accuracy for all the components." Prompt structure frozen; productionised.

"Through this series of iterative experiments, we were able to finalize our approach."

Heuristics for methodology iteration

  1. Stay small. Zalando's sample set was "a set of sample UI components of varying complexity from simple buttons and to more complex Select components" — tractable enough to evaluate by eye per round.
  2. Single-variable iteration where possible. Each of Zalando's five iterations changes exactly one prompt layer. Changing two at once makes the attribution of a change in accuracy ambiguous.
  3. Name the failure mode per round. Zalando's retrospective includes "why it failed" for every experiment. A hypothesis-driven iteration discovers structural prompt requirements faster than blind tuning.
  4. Stop when accuracy is "high" on the sample set, not when it's perfect. Residual errors become the post-migration manual-review workload and the prompt-regression test fixtures; you don't need to squeeze them out in the methodology phase.

When methodology iteration beats runtime refinement

  • Migration-scale batch jobs with a small set of component shapes: offline iteration amortises over thousands of transformations without paying per-request judge-LLM cost.
  • Deterministic correctness criteria (does this compile? does this pass the test?): the runtime judge doesn't add signal beyond what the compile-error / test-failure provides.
  • Finite target domain (a specific pair of libraries, a specific transformation): the prompt can be tuned to the domain and shipped; no need for runtime adaptability.

When runtime refinement beats methodology iteration

  • Open-ended generation where each request has a different goal (image generation, translation of arbitrary text): no single frozen prompt can anticipate every input.
  • Quality judgement that requires a judge (visual appeal, cultural adaptation): compile-errors are insufficient; a VLM / LLM judge has to weigh in per request.

Seen in

Last updated · 501 distilled / 1,218 read