PATTERN Cited by 1 source
Cross-model prompt adaptation¶
Cross-model prompt adaptation is the pattern of treating the prompt not as a hand-written string tied to one model, but as a compiled artifact re-generated per target model by an automated optimiser like DSPy. The forcing function: prompts hand-tuned for one model (especially frontier reasoning models) do not transfer cleanly to smaller / cheaper / open-weight models, and regression-chasing by hand takes weeks.
Forcing function¶
From the Dropbox Dash post:
"A prototype might lean on a state-of-the-art model, but real systems have latency and cost budgets, which usually means migrating to smaller or cheaper models. The catch is that prompts often don't transfer cleanly across models."
— (Source: sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy)
Manual re-tuning has three costs:
- Time — 1–2 weeks of iteration per target model at Dash.
- Quality regressions — small prompt edits ripple through corner cases unpredictably.
- Coverage — hand-tuners optimise for the examples they see, not the distribution.
Shape¶
- Fix the task + data + metric. The task ("rate relevance 1–5"), the dataset (human-labeled seed set), and the metric (NMSE vs humans, plus valid-JSON rate) all remain constant across model swaps.
- Freeze the starting prompt. The hand-tuned prompt from the previous model is the baseline, not the starting point for manual edits on the new model.
- Run the optimiser on the new model. DSPy (GEPA for feedback-driven rewrites, MIPROv2 for alternative settings) searches the prompt space against the fixed metric.
- Compare under identical conditions. Same eval set, same metric, same judgement rubric — only the prompt + model changed.
- Ship the optimised prompt for the new model. Keep the original prompt associated with the original model in version control; do not merge edits across models.
Why it works¶
- The optimiser absorbs idiosyncratic model biases. Every
model has a slightly different prior — an optimised prompt
for
gpt-oss-120bisn't the one foro3because the models have different failure modes. - Reduces model-swap regression risk. Product teams can adopt new models as they're released without blocking on a prompt rewrite cycle.
- Increases label-generation elasticity. Cheaper target
model + preserved quality → 10–100× more labels at the
same cost (Dash reported figure for
gpt-oss-120bvso3).
Results at Dash¶
| Target model | Role | Baseline NMSE | Optimised NMSE | Δ |
|---|---|---|---|---|
gpt-oss-120b (120B, open weight) |
Primary cost target | 8.83 | 4.86 | −45% |
gemma-3-12b (12B, small) |
Reliability stress test | 46.88 | 17.26 | −63% |
Plus, on gemma-3-12b, the malformed-JSON rate dropped from
42% to <1.1% (see concepts/structured-output-reliability).
Adaptation cycle time: 1–2 weeks manual → 1–2 days with DSPy.
When to reach for it¶
- You have multiple LLM vendors or sizes in the cost / latency frontier.
- You have a fixed task with a clean metric (judge rubrics, classification accuracy, structured-output validation).
- You expect to swap models frequently (vendor churn, open-weight releases, pricing changes).
- You have a human-labeled evaluation set stable enough to use as the optimisation target across model swaps.
Tradeoffs¶
- Needs an eval set. Without a fixed task + metric + data, there is nothing to optimise against. This pattern assumes you already paid the labeling loop cost.
- Full rewrites are risky on production-critical prompts. When the target is a high-stakes already-tuned prompt, full rewrites can destabilise corner-case behaviour. Switch to patterns/instruction-library-prompt-composition — constrain DSPy to selecting from a vetted instruction library, not rewriting end-to-end.
- Overfitting. The optimiser can copy example-specific keywords or alter task parameters (e.g. change the 1–5 scale to 1–3); explicit guardrails in the feedback string are required.
- Prompt opacity. Optimiser-generated prompts can be less human-readable than hand-written ones; debugging requires stepping back into the optimiser.
- Small-model capability ceiling. DSPy can optimise a small
model to operational reliability but not past its capability
ceiling —
gemma-3-12bwas ultimately rejected at Dash despite its 97% JSON-reliability improvement.
Contrast with adjacent patterns¶
- Vs prompt-optimizer flywheel: same optimiser, different objective. The flywheel minimises judge-vs-human disagreement for a fixed model; cross-model adaptation retargets across models on a fixed task.
- Vs instruction-library composition: the constrained / safer sibling when the target is a high-stakes production prompt. Cross-model adaptation is the broad-exploration variant (new cheaper model, full rewrite allowed).
- Vs patterns/human-calibrated-llm-labeling: sits inside it. The calibration stage of that pattern is where model adaptation happens; cross-model adaptation is the re-calibration loop.
Seen in¶
- sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy
— canonical instance. Same Dash relevance-judge task adapted
across three target models (
o3,gpt-oss-120b,gemma-3-12b) via DSPy GEPA and MIPROv2; full-rewrite allowed on the cheaper targets, constrained to instruction selection ono3.
Related¶
- systems/dspy — the optimiser driving the adaptation.
- concepts/llm-as-judge — the canonical task shape the pattern is applied to.
- concepts/nmse-normalized-mean-squared-error — the alignment metric held constant across model swaps.
- concepts/structured-output-reliability — the second optimisation axis exposed when targeting smaller models.
- patterns/prompt-optimizer-flywheel — same optimiser, single-model variant.
- patterns/instruction-library-prompt-composition — the constrained variant for high-stakes production prompts.
- patterns/human-calibrated-llm-labeling — the labeling loop in which this pattern sits.