CONCEPT Cited by 1 source

LLM-generated prompt regression test¶

Definition¶

An LLM-generated prompt regression test is a CI-level test that runs the production LLM pipeline against a checked-in library of LLM-generated, human-verified example inputs and diffs the output against the expected output. Any divergence from the golden output signals prompt drift — a change in the prompt (or the model, or the pipeline) that altered a previously-stable transformation.

Because examples are already curated for the prompt itself (see concepts/prompt-interface-mapping-examples-composition), the same artefact does double duty as prompt content and regression test fixture.

Canonical Zalando framing¶

From sources/2025-02-19-zalando-llm-powered-migration-of-ui-component-libraries:

"During development, we discovered that small adjustments to transformation instructions could lead to substantial changes in results. This highlighted a need to have prompt validation tests in place and led us to implement automated testing using LLM-generated examples. These examples served as validation tools and regression tests, helping us catch unexpected changes during the migration process."

Explicit recognition that prompts are code and need tests — articulated with operational language ("prompt validation tests", "regression tests") rather than hand-waving.

How it works¶

Generate candidate examples with the LLM, covering each component × scenario combination for the migration.
Human-verify — pair programmers + designers check the output is correct and visually equivalent.
Check in both the input (source-library syntax) and the expected output (target-library syntax) to the repository.
CI runs the production toolkit over every example on every change to:
the prompt template
the interface/mapping/examples layers
the toolkit code
(optionally) on a schedule, to catch provider-side model drift
Diff the output against the expected. Any difference fails CI.

Why LLM-generated examples are the right fixture¶

They already exist. The example library rides in the prompt for the model's benefit; reusing it as the test fixture requires no extra curation.
Representative. Because they were chosen to teach the model, they cover the transformation's characteristic patterns.
Human-verified baseline. The expected output is already correct — the examples wouldn't be in the prompt otherwise.

Prerequisites¶

Temperature=0 (concepts/temperature-zero-for-deterministic-codegen) is load-bearing. Without deterministic decoding, the diff flakes and the tests become noise.
Pinned model version. gpt-4o-2024-08-06 rather than gpt-4o; otherwise provider rollouts silently change golden outputs.
Pinned prompt template. Version the prompt so one-off changes don't cascade into test failures across the corpus.

Tradeoffs¶

Cost. Every CI run re-runs the LLM on every example. Prompt caching amortises most of this, but the output tokens still cost money.
False sensitivity to whitespace / formatting. A model-side decision to wrap at column 80 instead of 78 fails the diff; most teams add a normalisation pass before comparison.
Selection bias. If the example generator samples the same subspace as the test coverage, blind spots persist — a missing edge case in the prompt is a missing edge case in the test. Supplementing with hand-authored adversarial cases is wise.
Only catches drift, not ground-truth errors. If the golden output is itself wrong, the regression test happily confirms the wrong answer forever.

Relation to sibling primitives¶

vs "vibes-based" prompt iteration. Without these tests, prompt engineering becomes guess-and-check by eye. Small tweaks ship with no visibility into what else they broke. Zalando's "small adjustments … could lead to substantial changes" observation is the canonical articulation of this failure mode.
vs LLM-as-judge evaluation. LLM-as-judge (concepts/llm-as-judge) is for open-ended quality measurement (how good is this output?); this concept is for byte-exact regression detection (did this output change?). Complementary, not alternatives.

Seen in¶

sources/2025-02-19-zalando-llm-powered-migration-of-ui-component-libraries — canonical and only wiki instance.

concepts/temperature-zero-for-deterministic-codegen — the prerequisite reproducibility setting
concepts/prompt-interface-mapping-examples-composition — the examples that double as test fixtures
patterns/llm-only-code-migration-pipeline — the pattern this CI discipline protects
systems/zalando-component-migration-toolkit — the production tool the tests run against
companies/zalando