PATTERN Cited by 1 source
Prompt iteration as offline methodology discovery¶
Pattern¶
When preparing a frozen prompt for a production LLM pipeline, run a small, hypothesis-driven sequence of experiments offline — each testing one prompt structure against a held-out sample, each one human-authored based on the previous experiment's failure mode. Ship the converged prompt as a frozen artefact; do not run a judge-LLM refinement loop at inference time.
Forces¶
- Deterministic tasks don't benefit from runtime refinement. Code migration, SQL translation, schema mapping — tasks where correctness is binary — a judge LLM at inference time doesn't add signal beyond what compile-errors / test-failures / the deterministic validator already provide.
- Runtime loops are expensive. Every production call paying for N judge-LLM passes (plus N regenerations) is untenable for bulk jobs. Offline discovery amortises the iteration cost once.
- Hypothesis-driven iteration converges faster than blind tuning. Each experiment names its failure mode and the next round targets it; 5–6 rounds is typically enough to reach the accuracy knee.
Mechanism¶
- Choose a tractable sample set. Zalando: "a set of sample UI components of varying complexity from simple buttons and to more complex Select components". Criteria: varied enough to expose structural failure modes, small enough to evaluate by hand per round.
- Start with the simplest prompt. Raw source/target code; ask the LLM to migrate. Measure accuracy.
- Name the failure mode. "Why it failed." Is the LLM doing too many intermediate steps at once? Missing a class of information? Hallucinating?
- Design the next experiment to target the failure mode. Change one variable so the next round's accuracy change is attributable. Zalando's sequence changes exactly one layer per round (interface →
- auto-mapping → + verified mapping → + examples).
- Stop when accuracy reaches the knee on the sample set. The residual 10% becomes the post-migration manual-review bucket plus prompt-regression fixtures.
Contrast with runtime refinement pattern¶
patterns/drafter-evaluator-refinement-loop (Lyft AI localization, Instacart PIXEL) runs the same "generate → evaluate → refine" shape at inference time. The production instance pays the cost of N iterations per request. This pattern pays it once, during development.
| Offline methodology discovery | Runtime refinement loop | |
|---|---|---|
| Iteration author | Human | LLM judge |
| Iterations per production call | 0 (frozen prompt) | up to N per call |
| Cost per call | 1× | up to Nx |
| Applicable when | Target is deterministic, bounded | Target is open-ended, per-input |
| Failure discovery | Named during development | Continues in production |
| Example | Zalando UI migration | Instacart PIXEL image gen |
Production exemplar¶
Zalando's five-experiment arc (source-code-only → interface → interface+auto-mapping → interface+verified- mapping → +examples) converged on the Interface+Mapping+Examples composition in one internal hackathon, produced the frozen prompt, and shipped the Component Migration Toolkit (September 2024).
Heuristics¶
- One variable per round. Multi-variable rounds make attribution impossible.
- Log the failure mode. Without a named hypothesis, iteration becomes blind tuning.
- Stop at the knee. Perfection on the sample set isn't required; residual failures have downstream safeguards (human review + regression tests).
- Keep the sample set small. Big enough to expose structural issues, small enough to evaluate by hand.
- Preserve every round's artefacts. The failed prompts become the rationale documentation for why the final prompt looks as it does.
Consequences¶
Positive:
- Zero per-request refinement overhead. Production calls run a single frozen prompt.
- Cheap to develop. A handful of experiments on a small sample set is within hackathon scope.
- Output is a single versionable artefact. The prompt
- mapping + examples can be checked into git, diffed, reviewed like code.
Negative:
- Doesn't adapt to novel inputs. Anything the sample set didn't cover is handled at the same quality level as the average case — the pattern can't steer harder on edge cases like a runtime loop can.
- Model drift requires re-running methodology discovery. When the provider ships a new model, the prompt may need revisiting. Pinning model versions defers this but doesn't eliminate it.
- Requires deterministic task shape. Open-ended generation benefits from a runtime loop instead.
Seen in¶
- sources/2025-02-19-zalando-llm-powered-migration-of-ui-component-libraries — canonical production instantiation. Five offline experiments, Interface+Mapping+Examples frozen, toolkit shipped.
Related¶
- concepts/iterative-prompt-methodology — the concept this pattern anchors
- concepts/iterative-prompt-refinement — the runtime-loop sibling
- concepts/prompt-interface-mapping-examples-composition — the artefact Zalando's discovery produced
- patterns/llm-only-code-migration-pipeline — the production pattern this methodology prepares prompts for
- patterns/drafter-evaluator-refinement-loop — the runtime-loop pattern this one deliberately avoids
- companies/zalando