Skip to content

PATTERN Cited by 1 source

Instruction library prompt composition

Instruction-library prompt composition is the constrained / risk-calibrated alternative to full DSPy prompt rewrites: instead of letting the optimiser restructure the whole prompt, humans maintain a library of single-line instruction bullets ("rules of thumb") and DSPy is limited to selecting and composing bullets from the library. The prompt grows by accretion of small, reviewable additions rather than being regenerated end-to-end.

Forcing function

From the Dropbox Dash post (Source: sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy):

"When the target was our production o3 judge — already strong and widely depended on — the constraint flipped. Our goal was to make targeted improvements without destabilizing behavior relied on across multiple pipelines."

The high-stakes production prompt has two properties that rule out full rewrites:

  1. High baseline quality. Already manually tuned; large edits have asymmetric downside (many places to regress, few places to improve).
  2. High blast radius. Consumed by ranking + training-data generation + offline evaluation; a behavioural drift in one corner propagates through multiple pipelines.

Full DSPy rewrites could restructure the prompt in ways that shift corner-case behaviour unpredictably. Hence the constraint.

Shape

  1. Human writes bullets from disagreements. For each case where the judge disagreed with a human substantially, a human writes a short explanation of what the judge misunderstood and what it should have paid attention to.
  2. Distil each explanation into a single bullet. A reusable "rule of thumb" the model can follow. Example from the Dash post: "Documents older than a year should be rated at least one point lower unless they are clearly evergreen."
  3. Build an instruction library. Accumulate vetted bullets. Each bullet is reviewable, testable, and revertable independently — "small PRs with tests."
  4. DSPy selects + composes. The optimiser's job is to choose which bullets to include in the final composed prompt and in what order / combination — not to rewrite wording.
  5. Evaluate against the fixed metric. Include / exclude each bullet based on whether it improves alignment on the eval set without triggering unintended side effects.
  6. Ship the composed prompt. The baseline wording remains stable; only the bullet list grows / reorders.

Why it works

  • Bounded change surface. Each bullet is a discrete, atomic change. Regressions are easier to diagnose because each commit maps to a single bullet.
  • Human-vetted semantics. Bullets come from human error analyses, not from the optimiser hallucinating generalisations from training examples. This flips the overfitting failure mode of unconstrained DSPy (copying example-specific keywords into the prompt).
  • Reversibility. A bad bullet gets reverted like a git revert. A bad full-rewrite prompt requires re-running the optimiser from a known-good checkpoint.
  • Preserves voice. Stakeholders (product, legal, ops) who have signed off on the baseline prompt language don't have to re-review a new machine-written prompt.

Contrast with full-rewrite DSPy

Axis Full rewrite (GEPA / MIPROv2 end-to-end) Instruction-library composition
Change radius Entire prompt One bullet at a time
When to use New cheaper target model, exploratory Production-critical, already-strong baseline
Regression risk High, opaque Low, auditable
Optimiser role Generates wording Selects from a library
Rollback Re-run with different seed git revert the bullet
Used at Dash for gpt-oss-120b, gemma-3-12b Production o3 judge

Both live under the umbrella of patterns/cross-model-prompt-adaptation. Instruction-library is the conservative sibling.

Tradeoffs

  • Ceiling on improvement. Bullets only capture what humans have already understood. Novel failure modes the humans haven't yet diagnosed won't be addressed until someone writes the bullet.
  • Library maintenance burden. Someone has to triage new disagreements and write bullets. Scales with ongoing human effort per model update.
  • Bullet interaction effects. Two bullets can be each fine alone but conflict in composition. DSPy's selection over the library should surface this but may hide it under a compound metric.
  • Slower improvement curve. Per-run deltas are smaller than full rewrites — "incremental improvements" vs "45% reduction in NMSE" on cheaper-model runs.
  • Loses the model-switching benefit. Bullets written for o3 may not transfer to a different model; the mode is specifically for the high-stakes single-model prompt.

When to reach for it

  • Your prompt is already strong on its current model.
  • The prompt is consumed by multiple downstream pipelines (high blast radius on regressions).
  • You have humans doing post-disagreement analyses; the bullet-writing step is already happening informally.
  • You want auditable, reviewable, atomic prompt edits more than maximum per-iteration quality deltas.
  • You want the baseline prompt language preserved for stakeholder sign-off reasons.

Seen in

  • sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy — the Dash o3 production-judge optimisation mode. Humans distil post-disagreement explanations into single-line bullets; DSPy selects + composes; the prompt "grows by assembling the most helpful additional guidance rather than being constantly rewritten." Framed as "small PRs with tests" rather than "a large-scale refactor."
Last updated · 200 distilled / 1,178 read