PATTERN Cited by 3 sources
Prompt optimizer flywheel¶
Prompt optimizer flywheel is the pattern of closing a feedback loop between an LLM judge, a structured representation of judge-vs-human disagreements, and a prompt optimizer (e.g. DSPy) — so that each iteration both reduces disagreements and produces improved prompts automatically.
Intent¶
Hand-tuning prompts doesn't scale. At Dash:
- ~30 distinct prompts across ingest / LLM-as-judge / offline evals / the online agentic platform.
- 5–15 engineers tweaking these prompts at any time.
- Changes feel like "whack-a-mole": fix edge case → break something else.
- Adding or switching models requires re-tuning every prompt (Source: sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash).
The flywheel replaces scripted prompt tuning with an optimization loop whose objective is a measurable quality signal (judge disagreements, NDCG, etc.), driven by a prompt optimizer.
Mechanism (Dash realization)¶
- Score outputs with an LLM judge. Each judge score carries a rubric rationale. Disagreements against a human-labeled gold set are the measurable quality gap (see concepts/llm-as-judge).
- Structure disagreements as bullet points. Rather than handing DSPy raw prompts, Dash gives DSPy a bulleted list of specific disagreements between judge and human. This is the emergent insight — "we noticed we could create bullet points with the different disagreements and then have DSPy try to optimize the bullets themselves."
- DSPy optimizes to reduce the bullet set. The objective is minimizing the number of disagreements; DSPy searches the prompt space (wording, few-shot examples, chain-of-thought structure) to shrink the disagreement list.
- Re-run the judge. New prompt produces new outputs → new judge scores → new disagreement bullets → new DSPy input → new prompt. The loop runs offline, not per-query.
- Emergent behavior. "We started to create this really nice flywheel and ended up getting some nice results." The loop improves faster than either hand tuning or DSPy on raw prompts.
Why bullet-point disagreements specifically¶
Raw prompt → DSPy tries to improve the prompt wording in general — the optimizer has no anchor for what specifically is wrong.
Bullet-point disagreements → DSPy has a target: each bullet is a failure mode the optimizer can explicitly address. The output prompt is shaped by concrete evidence of where the previous prompt failed.
Why it works especially well for LLM-as-judge prompts¶
Dash highlights this as the primary place to apply the flywheel:
"LLMs as a judge have very crystal clear rubrics and evals. You know exactly what the outcome should be. You just need to have the ultimate prompt, and it's really good for that."
Judge prompts have:
- A bounded rubric (relevance 1–5, policy pass/fail, etc.).
- An oracle (human labels) to compute disagreements.
- A clear objective (disagreement count minimization).
Compared to generative agent prompts where "good" is fuzzy, judge prompts are a near-ideal optimizer target.
Benefits beyond single-prompt quality¶
Dash calls out three benefits this flywheel unlocks at scale:
- Prompt optimization — primary, obvious.
- Prompt management at scale. Instead of text strings in a repo with multiple engineers editing concurrently (whack-a-mole regressions), define prompts programmatically and let DSPy spit out the actual prompt from the definition + evals. Version control shifts from prompt text to prompt specification.
- Model switching. Plug a new model in, define goals, DSPy re-optimizes the prompt for that model. Critical for modern agentic systems: a planning LLM + many narrow sub-agents each on a specialized model → switching costs collapse from per-prompt manual tuning to automated re-optimization.
Tradeoffs¶
- Only as good as the judge + eval set. Biased judge → optimizer amplifies bias. Gold-set construction remains human-intensive.
- Judge agreement can plateau. "It might be impossible to get to zero disagreements. Even humans—multiple humans—will disagree on the relevance set." The flywheel has a floor.
- Prompt opacity. DSPy-emitted prompts may be less human- readable than hand-written ones. Debugging a bad emitted prompt can require stepping back into the optimizer.
- Compute. Optimizer runs + judge runs aren't free; budget offline cycles accordingly.
- Not a substitute for rubric design. If the rubric is wrong (wrong criteria, wrong scale), the flywheel optimizes toward the wrong place quickly.
When to reach for it¶
- You have many LLM-powered components whose quality you can measure (judge scores, NDCG, pass@k, task-success rate).
- You have a prompt library big enough that hand-tuning is an engineering bottleneck (Dash: ~30 prompts, 5–15 concurrent editors).
- You expect to swap models frequently (planner LLM + sub-agent LLMs + provider A/B tests).
- Your evaluation has clean rubrics (LLM-as-judge is the canonical fit).
Seen in¶
- sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash — Dash's LLM-as-judge improvement arc (baseline prompt → explanations → o3 → RAG-as-judge → DSPy); "bullet-point disagreements" as DSPy input named as the emergent discovery; prompt management at scale + model switching as stated additional benefits; ~30 prompts across Dash, 5–15 engineers tweaking concurrently.
-
sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search — same flywheel applied to the labeling stage (one stage earlier than the evaluation stage in the 2026-01-28 transcript). DSPy's objective here is minimising Mean Squared Error on the 1–5 relevance scale against a small human seed set — a direct numeric objective rather than bullet-point disagreement reduction. Confirms the flywheel generalises across Dash's labeling + evaluation stages with different specific objectives but the same close-the-loop shape.
-
sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy — concrete numbers edition of the flywheel. NMSE 8.83 → 4.86 (−45%) on
gpt-oss-120bvia DSPy GEPA; NMSE 46.88 → 17.26 (−63%) ongemma-3-12bvia DSPy MIPROv2; malformed JSON cut from 42% → <1.1% (97%+ reduction) ongemma-3-12b. Formalises the feedback-string shape (prediction − gold + direction + human rationale + model reasoning + guardrail text) that feeds GEPA's reflection loop. Adds the overfitting guardrails to the pattern: the optimiser will copy example-specific keywords / usernames / verbatim document phrases into the prompt, and will modify task parameters (e.g. change the 1–5 scale to 1–3), unless those invariants are explicitly in the feedback string. Splits the flywheel into three regimes by risk tolerance — full rewrite for new cheap targets (patterns/cross-model-prompt-adaptation), constrained bullet-selection for high-stakes production prompts (patterns/instruction-library-prompt-composition), disagreement-minimisation for fixed-model iteration (this pattern). Also: adaptation cycle time 1–2 weeks → 1–2 days, 10–100× label coverage on cheap targets at preserved quality.
Related¶
- systems/dspy — the optimizer at the center of the loop.
- concepts/llm-as-judge — the quality signal the loop optimizes.
- concepts/rag-as-a-judge — the judge extension that precedes the DSPy step in Dash's arc.
- concepts/nmse-normalized-mean-squared-error — the specific NMSE-scaled-0–100 form of the loop's objective used on the 2026-03-17 model-adaptation runs.
- concepts/structured-output-reliability — the co-equal quality axis exposed on smaller target models.
- concepts/ndcg — alternative / complementary quality signal for retrieval loops.
- systems/dropbox-dash — production instance.
- patterns/cross-model-prompt-adaptation — the cross-model-swap variant of the flywheel.
- patterns/instruction-library-prompt-composition — the constrained variant for high-stakes production prompts.