PATTERN Cited by 1 source

Cross-model prompt adaptation¶

Cross-model prompt adaptation is the pattern of treating the prompt not as a hand-written string tied to one model, but as a compiled artifact re-generated per target model by an automated optimiser like DSPy. The forcing function: prompts hand-tuned for one model (especially frontier reasoning models) do not transfer cleanly to smaller / cheaper / open-weight models, and regression-chasing by hand takes weeks.

Forcing function¶

From the Dropbox Dash post:

"A prototype might lean on a state-of-the-art model, but real systems have latency and cost budgets, which usually means migrating to smaller or cheaper models. The catch is that prompts often don't transfer cleanly across models."

— (Source: sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy)

Manual re-tuning has three costs:

Time — 1–2 weeks of iteration per target model at Dash.
Quality regressions — small prompt edits ripple through corner cases unpredictably.
Coverage — hand-tuners optimise for the examples they see, not the distribution.

Shape¶

Fix the task + data + metric. The task ("rate relevance 1–5"), the dataset (human-labeled seed set), and the metric (NMSE vs humans, plus valid-JSON rate) all remain constant across model swaps.
Freeze the starting prompt. The hand-tuned prompt from the previous model is the baseline, not the starting point for manual edits on the new model.
Run the optimiser on the new model. DSPy (GEPA for feedback-driven rewrites, MIPROv2 for alternative settings) searches the prompt space against the fixed metric.
Compare under identical conditions. Same eval set, same metric, same judgement rubric — only the prompt + model changed.
Ship the optimised prompt for the new model. Keep the original prompt associated with the original model in version control; do not merge edits across models.

Why it works¶

The optimiser absorbs idiosyncratic model biases. Every model has a slightly different prior — an optimised prompt for gpt-oss-120b isn't the one for o3 because the models have different failure modes.
Reduces model-swap regression risk. Product teams can adopt new models as they're released without blocking on a prompt rewrite cycle.
Increases label-generation elasticity. Cheaper target model + preserved quality → 10–100× more labels at the same cost (Dash reported figure for gpt-oss-120b vs o3).

Results at Dash¶

Target model	Role	Baseline NMSE	Optimised NMSE	Δ
`gpt-oss-120b` (120B, open weight)	Primary cost target	8.83	4.86	−45%
`gemma-3-12b` (12B, small)	Reliability stress test	46.88	17.26	−63%

Plus, on gemma-3-12b, the malformed-JSON rate dropped from 42% to <1.1% (see concepts/structured-output-reliability).

Adaptation cycle time: 1–2 weeks manual → 1–2 days with DSPy.

When to reach for it¶

You have multiple LLM vendors or sizes in the cost / latency frontier.
You have a fixed task with a clean metric (judge rubrics, classification accuracy, structured-output validation).
You expect to swap models frequently (vendor churn, open-weight releases, pricing changes).
You have a human-labeled evaluation set stable enough to use as the optimisation target across model swaps.

Tradeoffs¶

Needs an eval set. Without a fixed task + metric + data, there is nothing to optimise against. This pattern assumes you already paid the labeling loop cost.
Full rewrites are risky on production-critical prompts. When the target is a high-stakes already-tuned prompt, full rewrites can destabilise corner-case behaviour. Switch to patterns/instruction-library-prompt-composition — constrain DSPy to selecting from a vetted instruction library, not rewriting end-to-end.
Overfitting. The optimiser can copy example-specific keywords or alter task parameters (e.g. change the 1–5 scale to 1–3); explicit guardrails in the feedback string are required.
Prompt opacity. Optimiser-generated prompts can be less human-readable than hand-written ones; debugging requires stepping back into the optimiser.
Small-model capability ceiling. DSPy can optimise a small model to operational reliability but not past its capability ceiling — gemma-3-12b was ultimately rejected at Dash despite its 97% JSON-reliability improvement.

Contrast with adjacent patterns¶

Vs prompt-optimizer flywheel: same optimiser, different objective. The flywheel minimises judge-vs-human disagreement for a fixed model; cross-model adaptation retargets across models on a fixed task.
Vs instruction-library composition: the constrained / safer sibling when the target is a high-stakes production prompt. Cross-model adaptation is the broad-exploration variant (new cheaper model, full rewrite allowed).
Vs patterns/human-calibrated-llm-labeling: sits inside it. The calibration stage of that pattern is where model adaptation happens; cross-model adaptation is the re-calibration loop.

Seen in¶

sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy — canonical instance. Same Dash relevance-judge task adapted across three target models (o3, gpt-oss-120b, gemma-3-12b) via DSPy GEPA and MIPROv2; full-rewrite allowed on the cheaper targets, constrained to instruction selection on o3.

systems/dspy — the optimiser driving the adaptation.
concepts/llm-as-judge — the canonical task shape the pattern is applied to.
concepts/nmse-normalized-mean-squared-error — the alignment metric held constant across model swaps.
concepts/structured-output-reliability — the second optimisation axis exposed when targeting smaller models.
patterns/prompt-optimizer-flywheel — same optimiser, single-model variant.
patterns/instruction-library-prompt-composition — the constrained variant for high-stakes production prompts.
patterns/human-calibrated-llm-labeling — the labeling loop in which this pattern sits.