DROPBOX 2026-03-17 Tier 2

How we optimized Dash's relevance judge with DSPy¶

Summary¶

Dropbox Tech post on how the Dash relevance judge — an LLM-as-judge that scores (query, document) pairs on a 1–5 scale — was adapted across three different target models using DSPy as a systematic optimiser. Prior Dash posts framed DSPy as "in the toolbox"; this one turns it into a production adaptation mechanism with concrete deltas: NMSE cut 45% (8.83 → 4.86) on gpt-oss-120b, malformed JSON cut 97%+ (40% → <1.1%) on gemma-3-12b, and model-adaptation time dropped from 1–2 weeks of manual iteration to 1–2 days. The post also names the two DSPy optimisers used (GEPA for feedback-driven full rewrites, MIPROv2 for the JSON-reliability run) and introduces an instruction-library / bullet-selection mode for the high-stakes production o3 judge where full rewrites are too risky. Two independent quality axes are made first-class: alignment (NMSE vs humans) and operational reliability (JSON validity rate); DSPy is shown to improve both simultaneously.

Key takeaways¶

Prompts don't transfer cleanly across models. Dash's judge prompt was hand-tuned for OpenAI o3 and hit quality / cost ceilings. When applied to the cheaper open-weight gpt-oss-120b, quality dropped under NMSE. Manual re-tuning would take "weeks of iteration and regression chasing." This is the canonical forcing function for patterns/cross-model-prompt-adaptation: "prompts often don't transfer cleanly across models." (Source: sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy)
DSPy GEPA cut NMSE 45% on gpt-oss-120b. Manual baseline prompt: NMSE 8.83. DSPy-GEPA-optimised prompt: NMSE 4.86. GEPA is described as "a method that iteratively improves prompts by analyzing where the model disagrees with humans and generating feedback." Direction-aware feedback is constructed for each disagreement: predicted rating − expected rating, human rationale, model reasoning, plus explicit guardrail instructions. The optimiser runs the reflection loop on these structured feedback strings, not on raw scalar NMSE.
Adaptation cycle time: 1–2 weeks → 1–2 days. Named as the second-order benefit: model-swap regression risk collapses because the optimisation loop is automated. "Swap in newly released models with less regression risk and keep the judge aligned with evolving product needs." This is the operational consequence of making prompts a compiled artifact.
Label-coverage multiplier: 10–100× at same cost. The optimised gpt-oss-120b judge could generate 10–100× more labels than o3 within the same budget, increasing the training corpus for the downstream ranker and reducing overfitting to a small evaluation set. This feeds the patterns/human-calibrated-llm-labeling loop — the cheaper the judge, the larger the amplification multiplier over the human seed set.
NMSE is the scoring metric, normalised to 0–100. "Normalized mean squared error (NMSE), a metric that summarizes the model's average disagreement with humans as a single number… scaled to a 0–100 range." 0 = perfect agreement, higher = worse. Complements the MSE 0–16 metric from the 2026-02-26 labeling post — same underlying signal, different normalisation (scaled to 0–100 here vs. raw 0–16 on the 1–5 scale). Introduced on its own page as concepts/nmse-normalized-mean-squared-error.
Operational reliability is a separate, first-class axis. The judge's JSON output is consumed programmatically by other pipeline components. Malformed JSON → "examples may be dropped, batches can fail, and evaluation metrics become unreliable." Malformed outputs are counted as fully incorrect regardless of score content — formatting failures aren't cosmetic. New concept: concepts/structured-output-reliability.
Small-model JSON brittleness: 40% → <1.1% malformed. Baseline gemma-3-12b (12B param model): 358 of 856 responses (42%) were malformed JSON — it "was unreliable from an operational standpoint" before considering alignment quality. After DSPy MIPROv2 optimisation: 9 of 856 (1.05%) malformed — a 97%+ reduction. NMSE also dropped from 46.88 → 17.26 on the same run. Evidence that DSPy optimises structural reliability alongside semantic alignment.
GEPA vs MIPROv2 — two named optimisers with different objectives. GEPA = feedback-driven iterative prompt improvement (used for gpt-oss-120b; custom textual feedback from judge-vs-human gap + rationales). MIPROv2 = used for the gemma-3-12b run; reported in the result table as the prompt source. First production references to both named DSPy optimisers in this wiki.
Overfitting guardrails are not automatic. In early runs the optimiser would "overfit by copying specific keywords, usernames, or verbatim document phrases directly into prompts." It would also "modify key task parameters, such as changing the rating scale from 1–5 to 1–3 or 1–4." Dropbox added explicit guardrails in the feedback string ("avoid overfitting to specific example(s)… Do not include exact examples or keywords from them in the prompt… ensure you do not change the basic parameters of the task"). Teaches that the optimiser is a gradient-following surface, not a domain expert — the feedback channel has to carry the invariants you want preserved.
Three adaptation regimes, not one. DSPy is used at three different change radii, chosen by risk tolerance: (a) Full rewrite when adapting to a new, cheaper target model (gpt-oss-120b, gemma-3-12b) — broad exploration, end-to-end optimisation, GEPA / MIPROv2 free to restructure the whole prompt; (b) Constrained incremental edits when the target is the already-strong production o3 judge — large rewrites are too risky because the prompt is depended on across multiple pipelines; instead, humans write short bullet-instructions ("rules of thumb") distilled from post-disagreement explanations, and DSPy only selects and composes bullets from a library. "Small PRs with tests, not a large-scale refactor." This is a new named pattern: patterns/instruction-library-prompt-composition.
Feedback-string as the DSPy input contract. The code snippet in the post shows the exact feedback-construction shape: direction of gap + magnitude + human rationale + model's own reasoning + guardrail string. This is a direct realisation of the bullet-point-disagreements-as-optimiser-input pattern from the 2026-01-28 post, formalised as code.
gemma-3-12b ultimately rejected for production. Despite a 97% JSON-reliability improvement, the judge was still "too weak" for Dash's highest-quality production paths. Named operational value: "DSPy allowed us to reach that conclusion quickly and with measurable evidence" — the framework's value is the decision-acceleration loop, not a single quality delta.
Judge relevance scoring is central to multiple pipelines. "Ranking, training data generation, and offline evaluation" — all three depend on the judge. Reinforces the labeling loop framing from the 2026-02-26 post and the eval-loop framing from the 2026-01-28 transcript: same judge component, three consumers, one optimisation target.

Numbers / concrete details¶

NMSE baseline vs optimised on gpt-oss-120b: 8.83 → 4.86 (−45%).
NMSE baseline vs optimised on gemma-3-12b: 46.88 → 17.26 (−63%).
Malformed JSON on gemma-3-12b: 358/856 (42%) → 9/856 (1.05%) — a 97%+ reduction.
Valid JSON on gemma-3-12b: 498 → 847 out of 856.
Adaptation cycle time: 1–2 weeks manual → 1–2 days with DSPy.
Label generation throughput at constant cost: 10–100× more labels on the cheaper target model vs o3.
Models tested: OpenAI o3 (baseline production, reasoning-optimised, expensive); gpt-oss-120b (open weight, 120B params, primary cost target); gemma-3-12b (Google, 12B params, stress-test of small-model operational reliability).
Optimiser names: GEPA (feedback-driven, used on gpt-oss-120b); MIPROv2 (used on gemma-3-12b).
Evaluation metric: NMSE on the 1–5 relevance scale, scaled to 0–100 (0 = perfect agreement, higher = worse).
Failure-mode accounting: malformed JSON counted as fully incorrect, regardless of score content.
o3 incremental-optimisation mechanism: instruction library of single-line "rules-of-thumb" bullets (e.g. "Documents older than a year should be rated at least one point lower unless they are clearly evergreen"); DSPy selects + composes, does not rewrite.
Overfitting failure modes observed: (a) copying example-specific keywords / usernames / verbatim document phrases into the prompt; (b) changing the 1–5 rating scale to 1–3 or 1–4.
Feedback-string shape (from code listing): prediction−gold difference + direction (higher/lower) + human rationale + model's reasoning + guardrail instruction (no example-specific content, don't modify rating scale).

Caveats¶

No per-step o3 incremental numbers disclosed. The post asserts a cumulative improvement chart exists for the instruction-library rollout on o3 but the axes and per-step deltas aren't given in the text.
No end-to-end ranking-quality (NDCG) numbers. NMSE improvements on the judge are reported; whether those propagated to NDCG on the ranker downstream is not reported in this post.
Training-set / eval-set size not given. We see 856 total responses on the gemma-3-12b table; overall eval-set size, human-seed-set size, and training-corpus size aren't enumerated.
o3 instruction-library size unstated. Number of bullets in the library, selection cardinality per composed prompt, and cadence of additions aren't given.
Same MSE-vs-NMSE framing, different normalisation. MSE on 1–5 scale has range 0–16 (see sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search); NMSE here is scaled to 0–100. The post doesn't give the conversion factor explicitly.
GEPA vs MIPROv2 choice heuristic not spelled out. Why GEPA for gpt-oss-120b and MIPROv2 for gemma-3-12b? The post lists both but doesn't explain the model-to-optimiser mapping.
Production runtime judge not identified. gpt-oss-120b is the primary cost target but whether it's the current production judge (replacing o3) or an alternative path is implicit, not stated.
Reasoning-model vs weight-count confounds. o3 is reasoning-optimised; gpt-oss-120b and gemma-3-12b are standard transformers. Attributing quality gaps purely to cost/size vs to the reasoning capability is tricky from this post alone.

Relationship to existing wiki pages¶

Third in the Dropbox Dash LLM-judge trilogy.
sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash — 5-step disagreement-reduction arc (prompt refinement → o3 → RAG-as-judge → DSPy bullet-optimiser).
sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search — labeling loop: humans calibrate judge, judge amplifies ~100× for training data; MSE on 1–5 scale (0–16).
This post (2026-03-17) — model-adaptation loop: prompts don't transfer; DSPy retargets across o3 → gpt-oss-120b → gemma-3-12b, with NMSE + JSON-validity as the two quality axes, and instruction-library / bullet-selection as the constrained mode for production o3.
Extends DSPy — adds GEPA and MIPROv2 as named optimisers, adds model-adaptation as a primary use case beyond single-prompt optimisation.
Extends patterns/prompt-optimizer-flywheel — adds concrete NMSE/JSON numbers, the overfitting failure modes and the guardrail-in-feedback mitigation, and the two-regime framing (full rewrite vs constrained instruction-library edits).
Extends concepts/llm-as-judge — adds structured-output reliability as a co-equal quality axis to alignment, formalised as concepts/structured-output-reliability.
New pattern patterns/cross-model-prompt-adaptation — the general shape of "prompts don't transfer; use DSPy as the retarget mechanism."
New pattern patterns/instruction-library-prompt-composition — the risk-calibrated alternative to full rewrites when the target is a high-stakes production judge.
New concept concepts/nmse-normalized-mean-squared-error — the specific metric variant used, distinguishing it from the 0–16 MSE framing of the 2026-02-26 post.