PATTERN Cited by 1 source
Instruction library prompt composition¶
Instruction-library prompt composition is the constrained / risk-calibrated alternative to full DSPy prompt rewrites: instead of letting the optimiser restructure the whole prompt, humans maintain a library of single-line instruction bullets ("rules of thumb") and DSPy is limited to selecting and composing bullets from the library. The prompt grows by accretion of small, reviewable additions rather than being regenerated end-to-end.
Forcing function¶
From the Dropbox Dash post (Source: sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy):
"When the target was our production o3 judge — already strong and widely depended on — the constraint flipped. Our goal was to make targeted improvements without destabilizing behavior relied on across multiple pipelines."
The high-stakes production prompt has two properties that rule out full rewrites:
- High baseline quality. Already manually tuned; large edits have asymmetric downside (many places to regress, few places to improve).
- High blast radius. Consumed by ranking + training-data generation + offline evaluation; a behavioural drift in one corner propagates through multiple pipelines.
Full DSPy rewrites could restructure the prompt in ways that shift corner-case behaviour unpredictably. Hence the constraint.
Shape¶
- Human writes bullets from disagreements. For each case where the judge disagreed with a human substantially, a human writes a short explanation of what the judge misunderstood and what it should have paid attention to.
- Distil each explanation into a single bullet. A reusable "rule of thumb" the model can follow. Example from the Dash post: "Documents older than a year should be rated at least one point lower unless they are clearly evergreen."
- Build an instruction library. Accumulate vetted bullets. Each bullet is reviewable, testable, and revertable independently — "small PRs with tests."
- DSPy selects + composes. The optimiser's job is to choose which bullets to include in the final composed prompt and in what order / combination — not to rewrite wording.
- Evaluate against the fixed metric. Include / exclude each bullet based on whether it improves alignment on the eval set without triggering unintended side effects.
- Ship the composed prompt. The baseline wording remains stable; only the bullet list grows / reorders.
Why it works¶
- Bounded change surface. Each bullet is a discrete, atomic change. Regressions are easier to diagnose because each commit maps to a single bullet.
- Human-vetted semantics. Bullets come from human error analyses, not from the optimiser hallucinating generalisations from training examples. This flips the overfitting failure mode of unconstrained DSPy (copying example-specific keywords into the prompt).
- Reversibility. A bad bullet gets reverted like a git revert. A bad full-rewrite prompt requires re-running the optimiser from a known-good checkpoint.
- Preserves voice. Stakeholders (product, legal, ops) who have signed off on the baseline prompt language don't have to re-review a new machine-written prompt.
Contrast with full-rewrite DSPy¶
| Axis | Full rewrite (GEPA / MIPROv2 end-to-end) | Instruction-library composition |
|---|---|---|
| Change radius | Entire prompt | One bullet at a time |
| When to use | New cheaper target model, exploratory | Production-critical, already-strong baseline |
| Regression risk | High, opaque | Low, auditable |
| Optimiser role | Generates wording | Selects from a library |
| Rollback | Re-run with different seed | git revert the bullet |
| Used at Dash for | gpt-oss-120b, gemma-3-12b |
Production o3 judge |
Both live under the umbrella of patterns/cross-model-prompt-adaptation. Instruction-library is the conservative sibling.
Tradeoffs¶
- Ceiling on improvement. Bullets only capture what humans have already understood. Novel failure modes the humans haven't yet diagnosed won't be addressed until someone writes the bullet.
- Library maintenance burden. Someone has to triage new disagreements and write bullets. Scales with ongoing human effort per model update.
- Bullet interaction effects. Two bullets can be each fine alone but conflict in composition. DSPy's selection over the library should surface this but may hide it under a compound metric.
- Slower improvement curve. Per-run deltas are smaller than full rewrites — "incremental improvements" vs "45% reduction in NMSE" on cheaper-model runs.
- Loses the model-switching benefit. Bullets written for
o3may not transfer to a different model; the mode is specifically for the high-stakes single-model prompt.
When to reach for it¶
- Your prompt is already strong on its current model.
- The prompt is consumed by multiple downstream pipelines (high blast radius on regressions).
- You have humans doing post-disagreement analyses; the bullet-writing step is already happening informally.
- You want auditable, reviewable, atomic prompt edits more than maximum per-iteration quality deltas.
- You want the baseline prompt language preserved for stakeholder sign-off reasons.
Seen in¶
- sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy
— the Dash
o3production-judge optimisation mode. Humans distil post-disagreement explanations into single-line bullets; DSPy selects + composes; the prompt "grows by assembling the most helpful additional guidance rather than being constantly rewritten." Framed as "small PRs with tests" rather than "a large-scale refactor."
Related¶
- systems/dspy — the optimiser, constrained to selection.
- patterns/prompt-optimizer-flywheel — the general loop this pattern is a risk-calibrated variant of.
- patterns/cross-model-prompt-adaptation — sibling pattern; the unconstrained variant used on new cheap targets.
- concepts/llm-as-judge — the target the Dash instance optimises.
- systems/dash-relevance-ranker — the downstream consumer whose stability this pattern protects.