Skip to content

PATTERN Cited by 1 source

Prompt iteration as offline methodology discovery

Pattern

When preparing a frozen prompt for a production LLM pipeline, run a small, hypothesis-driven sequence of experiments offline — each testing one prompt structure against a held-out sample, each one human-authored based on the previous experiment's failure mode. Ship the converged prompt as a frozen artefact; do not run a judge-LLM refinement loop at inference time.

Forces

  • Deterministic tasks don't benefit from runtime refinement. Code migration, SQL translation, schema mapping — tasks where correctness is binary — a judge LLM at inference time doesn't add signal beyond what compile-errors / test-failures / the deterministic validator already provide.
  • Runtime loops are expensive. Every production call paying for N judge-LLM passes (plus N regenerations) is untenable for bulk jobs. Offline discovery amortises the iteration cost once.
  • Hypothesis-driven iteration converges faster than blind tuning. Each experiment names its failure mode and the next round targets it; 5–6 rounds is typically enough to reach the accuracy knee.

Mechanism

  1. Choose a tractable sample set. Zalando: "a set of sample UI components of varying complexity from simple buttons and to more complex Select components". Criteria: varied enough to expose structural failure modes, small enough to evaluate by hand per round.
  2. Start with the simplest prompt. Raw source/target code; ask the LLM to migrate. Measure accuracy.
  3. Name the failure mode. "Why it failed." Is the LLM doing too many intermediate steps at once? Missing a class of information? Hallucinating?
  4. Design the next experiment to target the failure mode. Change one variable so the next round's accuracy change is attributable. Zalando's sequence changes exactly one layer per round (interface →
  5. auto-mapping → + verified mapping → + examples).
  6. Stop when accuracy reaches the knee on the sample set. The residual 10% becomes the post-migration manual-review bucket plus prompt-regression fixtures.

Contrast with runtime refinement pattern

patterns/drafter-evaluator-refinement-loop (Lyft AI localization, Instacart PIXEL) runs the same "generate → evaluate → refine" shape at inference time. The production instance pays the cost of N iterations per request. This pattern pays it once, during development.

Offline methodology discovery Runtime refinement loop
Iteration author Human LLM judge
Iterations per production call 0 (frozen prompt) up to N per call
Cost per call up to Nx
Applicable when Target is deterministic, bounded Target is open-ended, per-input
Failure discovery Named during development Continues in production
Example Zalando UI migration Instacart PIXEL image gen

Production exemplar

Zalando's five-experiment arc (source-code-only → interface → interface+auto-mapping → interface+verified- mapping → +examples) converged on the Interface+Mapping+Examples composition in one internal hackathon, produced the frozen prompt, and shipped the Component Migration Toolkit (September 2024).

Heuristics

  1. One variable per round. Multi-variable rounds make attribution impossible.
  2. Log the failure mode. Without a named hypothesis, iteration becomes blind tuning.
  3. Stop at the knee. Perfection on the sample set isn't required; residual failures have downstream safeguards (human review + regression tests).
  4. Keep the sample set small. Big enough to expose structural issues, small enough to evaluate by hand.
  5. Preserve every round's artefacts. The failed prompts become the rationale documentation for why the final prompt looks as it does.

Consequences

Positive:

  • Zero per-request refinement overhead. Production calls run a single frozen prompt.
  • Cheap to develop. A handful of experiments on a small sample set is within hackathon scope.
  • Output is a single versionable artefact. The prompt
  • mapping + examples can be checked into git, diffed, reviewed like code.

Negative:

  • Doesn't adapt to novel inputs. Anything the sample set didn't cover is handled at the same quality level as the average case — the pattern can't steer harder on edge cases like a runtime loop can.
  • Model drift requires re-running methodology discovery. When the provider ships a new model, the prompt may need revisiting. Pinning model versions defers this but doesn't eliminate it.
  • Requires deterministic task shape. Open-ended generation benefits from a runtime loop instead.

Seen in

Last updated · 501 distilled / 1,218 read