CONCEPT Cited by 1 source
LLM-generated prompt regression test¶
Definition¶
An LLM-generated prompt regression test is a CI-level test that runs the production LLM pipeline against a checked-in library of LLM-generated, human-verified example inputs and diffs the output against the expected output. Any divergence from the golden output signals prompt drift — a change in the prompt (or the model, or the pipeline) that altered a previously-stable transformation.
Because examples are already curated for the prompt itself (see concepts/prompt-interface-mapping-examples-composition), the same artefact does double duty as prompt content and regression test fixture.
Canonical Zalando framing¶
From sources/2025-02-19-zalando-llm-powered-migration-of-ui-component-libraries:
"During development, we discovered that small adjustments to transformation instructions could lead to substantial changes in results. This highlighted a need to have prompt validation tests in place and led us to implement automated testing using LLM-generated examples. These examples served as validation tools and regression tests, helping us catch unexpected changes during the migration process."
Explicit recognition that prompts are code and need tests — articulated with operational language ("prompt validation tests", "regression tests") rather than hand-waving.
How it works¶
- Generate candidate examples with the LLM, covering each component × scenario combination for the migration.
- Human-verify — pair programmers + designers check the output is correct and visually equivalent.
- Check in both the input (source-library syntax) and the expected output (target-library syntax) to the repository.
- CI runs the production toolkit over every example on every change to:
- the prompt template
- the interface/mapping/examples layers
- the toolkit code
- (optionally) on a schedule, to catch provider-side model drift
- Diff the output against the expected. Any difference fails CI.
Why LLM-generated examples are the right fixture¶
- They already exist. The example library rides in the prompt for the model's benefit; reusing it as the test fixture requires no extra curation.
- Representative. Because they were chosen to teach the model, they cover the transformation's characteristic patterns.
- Human-verified baseline. The expected output is already correct — the examples wouldn't be in the prompt otherwise.
Prerequisites¶
- Temperature=0 (concepts/temperature-zero-for-deterministic-codegen) is load-bearing. Without deterministic decoding, the diff flakes and the tests become noise.
- Pinned model version.
gpt-4o-2024-08-06rather thangpt-4o; otherwise provider rollouts silently change golden outputs. - Pinned prompt template. Version the prompt so one-off changes don't cascade into test failures across the corpus.
Tradeoffs¶
- Cost. Every CI run re-runs the LLM on every example. Prompt caching amortises most of this, but the output tokens still cost money.
- False sensitivity to whitespace / formatting. A model-side decision to wrap at column 80 instead of 78 fails the diff; most teams add a normalisation pass before comparison.
- Selection bias. If the example generator samples the same subspace as the test coverage, blind spots persist — a missing edge case in the prompt is a missing edge case in the test. Supplementing with hand-authored adversarial cases is wise.
- Only catches drift, not ground-truth errors. If the golden output is itself wrong, the regression test happily confirms the wrong answer forever.
Relation to sibling primitives¶
- vs "vibes-based" prompt iteration. Without these tests, prompt engineering becomes guess-and-check by eye. Small tweaks ship with no visibility into what else they broke. Zalando's "small adjustments … could lead to substantial changes" observation is the canonical articulation of this failure mode.
- vs LLM-as-judge evaluation. LLM-as-judge (concepts/llm-as-judge) is for open-ended quality measurement (how good is this output?); this concept is for byte-exact regression detection (did this output change?). Complementary, not alternatives.
Seen in¶
- sources/2025-02-19-zalando-llm-powered-migration-of-ui-component-libraries — canonical and only wiki instance.
Related¶
- concepts/temperature-zero-for-deterministic-codegen — the prerequisite reproducibility setting
- concepts/prompt-interface-mapping-examples-composition — the examples that double as test fixtures
- patterns/llm-only-code-migration-pipeline — the pattern this CI discipline protects
- systems/zalando-component-migration-toolkit — the production tool the tests run against
- companies/zalando