Skip to content

CONCEPT Cited by 1 source

Prompt optimisation feedback loop

A prompt optimisation feedback loop is a production pattern in which LLM prompts are continuously edited based on real production outputs, with domain-specific examples promoted into the prompt's few-shot / multi-shot context as the system observes which shots actually correlate with high-accuracy outputs. Accuracy compounds over time as the prompt accumulates well-chosen examples; the loop is not model fine-tuning — it operates entirely at the prompt layer.

Canonicalised on the wiki via AArete's Doczy.ai disclosure (sources/2026-06-02-aws-automating-contract-intelligence-with-doczyai-on-aws):

"Through few-shot and multi-shot prompting, the platform continuously edits the prompt on domain-specific examples and based on real outputs, creating a feedback loop that compounds accuracy improvements over time."

Structural pieces

  1. Per-class prompt template. The pipeline detects each document's file class first; a prompt template is selected per class (rather than one global prompt for all documents).
  2. Few-shot / multi-shot examples in the prompt. The prompt carries domain-specific examples — labelled input/output pairs that demonstrate the desired extraction behaviour for that file class.
  3. Production output observation. Real LLM outputs are scored (against gold labels, against downstream-system feedback, against human review on sampled cases, or against LLM-judge evaluation).
  4. Continuous prompt editing. Examples that correlate with high-accuracy outputs are promoted into the prompt; examples that correlate with low-accuracy outputs are removed or replaced; new edge cases observed in production are added as new shots.
  5. Compounding accuracy. Each iteration leaves the prompt at least as good as before — the loop is monotonic in expected value (subject to the noise of any single iteration).

Sibling concepts on the wiki

Concept Source What's optimised How
Prompt optimisation feedback loop Doczy.ai (2026-06-02) Per-class prompt templates Continuous edit on production outputs
concepts/agent-self-correction-loop Databricks Genie (2026-05-08) Per-trajectory agent decisions Self-correction within a single run
systems/gepa-prompt-optimizer Databricks Genie (2026-05-08) Prompt at training time Genetic / evolutionary prompt search
concepts/few-shot-prompt-template Multiple sources Static few-shot prompt Set once, not edited
patterns/llm-judge-as-inline-pipeline-stage Databricks Unlocking Archives (2026-05-11) Output quality control LLM evaluates LLM at inference time

This concept is distinct from agent self-correction — Doczy.ai's loop happens between extraction runs (the prompt observed by run N+1 differs from run N), whereas an agent self-correction loop happens within a single run (the agent revises its own answer mid-trajectory).

It's distinct from GEPA in that GEPA is a training-time prompt-search algorithm; Doczy.ai's loop runs continuously in production and edits prompts based on production observations, not on a held-out training set.

It's distinct from fine-tuning in that the model weights never change; only the prompt context does.

What "compounds over time" means

The loop is monotonic in expectation because each iteration is gated by an evaluation step:

  • A new candidate prompt is only promoted into production if it scores at least as well as the current prompt on the most recent eval window.
  • The prompt's example set is append-and-replace, not replace-only — well-performing examples are retained across iterations.
  • Production traffic continually surfaces new edge cases that weren't in the training set; the loop captures them as new shots.

The verbatim disclosure ("compounds accuracy improvements over time") suggests a long time horizon — Doczy.ai's 22-month production envelope had time for many such iterations.

Required substrate

  • File-class detection — without per-class routing, a single global prompt averages over all classes and the loop can't make per-class progress.
  • Production output capture — the loop needs to see real outputs. In Doczy.ai's case the Snowflake structured-data sink and the dashboards built over it provide the observation surface.
  • Evaluation signal — gold-label, downstream-feedback, human-review, or LLM-judge.
  • Domain expertise"AArete's team of experts will configure this solution" — the examples themselves come from domain-knowledgeable humans curating the prompt's example bank.

When to apply

  • Document-extraction pipelines where the document class space is finite and well-bounded (contracts, claims, regulatory filings).
  • Production scale where the loop has enough iterations to reach a high-accuracy regime.
  • Domains with available evaluation signal (gold labels, feedback from downstream systems, expert review).

When not to apply

  • Open-ended generation tasks where there's no objective accuracy signal.
  • Pipelines without clear file classes (the loop has nothing to segment by).
  • Cases where model fine-tuning is more cost-effective than prompt-layer iteration.

Risks

  • Prompt drift. As the prompt accumulates examples it can over-fit to recent traffic; periodic regression-test discipline against historical eval sets is needed.
  • Reward hacking. If the eval signal is biased (e.g. optimising for whatever the LLM judge prefers), the loop will converge to that bias.
  • Example library size limits. Prompts have token-budget ceilings; the loop must select among examples, not just accumulate.

Caveats

The Doczy.ai disclosure does not describe how the prompt-edit mechanism works in detail (manual curation by experts vs automated selection vs hybrid; how the system decides which shots to swap in/out; what the eval signal is per file class). The wiki captures the disclosed shape of the pattern; mechanism details are AArete IP.

Seen in

Last updated · 542 distilled / 1,571 read