Skip to content

CONCEPT Cited by 1 source

LLM-generated prompt regression test

Definition

An LLM-generated prompt regression test is a CI-level test that runs the production LLM pipeline against a checked-in library of LLM-generated, human-verified example inputs and diffs the output against the expected output. Any divergence from the golden output signals prompt drift — a change in the prompt (or the model, or the pipeline) that altered a previously-stable transformation.

Because examples are already curated for the prompt itself (see concepts/prompt-interface-mapping-examples-composition), the same artefact does double duty as prompt content and regression test fixture.

Canonical Zalando framing

From sources/2025-02-19-zalando-llm-powered-migration-of-ui-component-libraries:

"During development, we discovered that small adjustments to transformation instructions could lead to substantial changes in results. This highlighted a need to have prompt validation tests in place and led us to implement automated testing using LLM-generated examples. These examples served as validation tools and regression tests, helping us catch unexpected changes during the migration process."

Explicit recognition that prompts are code and need tests — articulated with operational language ("prompt validation tests", "regression tests") rather than hand-waving.

How it works

  1. Generate candidate examples with the LLM, covering each component × scenario combination for the migration.
  2. Human-verify — pair programmers + designers check the output is correct and visually equivalent.
  3. Check in both the input (source-library syntax) and the expected output (target-library syntax) to the repository.
  4. CI runs the production toolkit over every example on every change to:
  5. the prompt template
  6. the interface/mapping/examples layers
  7. the toolkit code
  8. (optionally) on a schedule, to catch provider-side model drift
  9. Diff the output against the expected. Any difference fails CI.

Why LLM-generated examples are the right fixture

  • They already exist. The example library rides in the prompt for the model's benefit; reusing it as the test fixture requires no extra curation.
  • Representative. Because they were chosen to teach the model, they cover the transformation's characteristic patterns.
  • Human-verified baseline. The expected output is already correct — the examples wouldn't be in the prompt otherwise.

Prerequisites

  • Temperature=0 (concepts/temperature-zero-for-deterministic-codegen) is load-bearing. Without deterministic decoding, the diff flakes and the tests become noise.
  • Pinned model version. gpt-4o-2024-08-06 rather than gpt-4o; otherwise provider rollouts silently change golden outputs.
  • Pinned prompt template. Version the prompt so one-off changes don't cascade into test failures across the corpus.

Tradeoffs

  • Cost. Every CI run re-runs the LLM on every example. Prompt caching amortises most of this, but the output tokens still cost money.
  • False sensitivity to whitespace / formatting. A model-side decision to wrap at column 80 instead of 78 fails the diff; most teams add a normalisation pass before comparison.
  • Selection bias. If the example generator samples the same subspace as the test coverage, blind spots persist — a missing edge case in the prompt is a missing edge case in the test. Supplementing with hand-authored adversarial cases is wise.
  • Only catches drift, not ground-truth errors. If the golden output is itself wrong, the regression test happily confirms the wrong answer forever.

Relation to sibling primitives

  • vs "vibes-based" prompt iteration. Without these tests, prompt engineering becomes guess-and-check by eye. Small tweaks ship with no visibility into what else they broke. Zalando's "small adjustments … could lead to substantial changes" observation is the canonical articulation of this failure mode.
  • vs LLM-as-judge evaluation. LLM-as-judge (concepts/llm-as-judge) is for open-ended quality measurement (how good is this output?); this concept is for byte-exact regression detection (did this output change?). Complementary, not alternatives.

Seen in

Last updated · 501 distilled / 1,218 read