SYSTEM Cited by 4 sources

DsPy¶

DsPy (originally from Stanford NLP, now Databricks-sponsored) is a framework for programmatic LLM pipeline construction — you declare what a step should do via typed signatures + docstrings, and DsPy compiles that into prompts + few-shot examples + tool invocations. The framework decouples what the model is asked to do from how the prompt is phrased, and can optimize prompts against metrics.

Why it matters for system design¶

Prompts become an optimization target, not a hand-written string. That changes how agent teams structure iteration: you tune prompts and tools on measurable axes rather than by vibes.
Tool definitions collapse to docstrings. A normal function + signature + doc is enough — the LLM infers input parsing and output interpretation. Infrastructure code (LLM client, parser, state) isn't duplicated per tool.
Framework level, not prompt level. Swapping models, adding few-shot examples, or changing validation all happen above user code.

Named optimisers¶

GEPA — feedback-driven iterative improvement. Ingests structured per-example feedback strings (gap direction + magnitude
human rationale + model reasoning + guardrail text) and runs a reflection loop: evaluate → surface failure modes in plain language → revise prompt → repeat. Used on the Dash gpt-oss-120b relevance-judge adaptation; NMSE 8.83 → 4.86 (−45%) (Source: sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy).
MIPROv2 — used on the Dash gemma-3-12b stress-test run. Lifted NMSE from 46.88 → 17.26 and cut malformed-JSON rate from 42% → <1.1%. Evidence that DSPy optimisers target operational reliability as well as alignment when malformed outputs are counted as fully incorrect.

Operating modes at Dash¶

Dash uses DSPy at three different change radii depending on risk tolerance of the target prompt:

Full end-to-end rewrite — new cheaper target model (gpt-oss-120b, gemma-3-12b), broad exploration allowed. See patterns/cross-model-prompt-adaptation.
Constrained bullet selection — high-stakes production o3 prompt; DSPy selects from a human-curated instruction library rather than rewriting wording.
Disagreement-minimisation loop — closed-loop iteration on a fixed model, driven by structured judge-vs-human disagreements. See patterns/prompt-optimizer-flywheel.

Overfitting failure modes (and guardrails)¶

Observed at Dash when DSPy runs without explicit constraints:

Copying example-specific content — keywords, usernames, verbatim document phrases pasted directly into the optimised prompt. Improves training-set scores; fails to generalise.
Modifying task parameters — changing the 1–5 rating scale to 1–3 or 1–4 because shorter scales happen to reduce disagreement arithmetically.

Mitigation: encode invariants directly in the feedback string the optimiser consumes, e.g. "avoid overfitting to specific example(s)… Do not include exact examples or keywords from them in the prompt… ensure you do not change the basic parameters of the task (e.g. changing the rating range to be anything but 1–5)." The optimiser is a gradient-following surface; the feedback channel has to carry the invariants.

Seen in¶

sources/2025-12-03-databricks-ai-agent-debug-databases — Databricks' internal systems/storex agent framework is "DsPy-inspired": Scala tools + docstrings + framework-owned prompt/LLM/state plumbing. The post cites Databricks' DsPy docs and frames the pattern as the key enabler of fast agent iteration.
sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash — Dropbox Dash production usage. DSPy sits at the end of Dash's LLM-as-judge disagreement-reduction arc (baseline → prompt refinement → o3 → RAG-as-a-judge → DSPy), with an emergent usage pattern documented explicitly: instead of feeding DSPy raw prompts, Dash feeds it bullet-pointed judge-vs-human disagreements and lets DSPy minimize the disagreement set (the patterns/prompt-optimizer-flywheel). Clemm also states three scale-level benefits beyond single-prompt quality: (a) LLM-as-judge prompts as the canonical DSPy target (clear rubrics + oracle labels + clear objective); (b) Prompt management at scale — Dash has ~30 prompts across ingest / judge / offline evals / online agentic platform with 5–15 engineers editing at any time; programmatic-spec-+-DSPy beats hand-edited strings in a repo ("whack-a-mole" regressions); (c) Model switching — plug a new model in, define goals, DSPy spits out an optimized prompt; critical for agentic systems with a planning LLM + many narrow sub-agents on specialized models. Framed as "absolutely essential at scale."
sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search — the labeling-pipeline companion naming DSPy explicitly by URL (dspy.ai) as "a library for programmatically optimizing LLM prompts against defined evaluation targets." Canonical target at Dash: the labeling (not evaluation) judge prompt, optimising against MSE on the 1–5 relevance scale (range 0–16) computed against a small human-labeled seed set. Complementary framing to the 2026-01-28 disagreement-bullet reduction — same loop, different objective (MSE directly vs disagreement count). Confirms DSPy is operational across multiple stages of Dash's LLM stack (judging + labeling), not just one.
sources/2026-03-17-dropbox-optimized-dash-relevance-judge-dspy — third in the Dash DSPy trilogy, model-adaptation edition. Turns DSPy from "in the toolbox" into a production cross-model adaptation mechanism with concrete deltas. Names two DSPy optimisers: GEPA (feedback-driven rewrites on gpt-oss-120b, cut NMSE 45% from 8.83 → 4.86) and MIPROv2 (on gemma-3-12b, cut NMSE from 46.88 → 17.26 and malformed-JSON rate from 42% → <1.1% — 97%+ reduction on structured-output reliability). Adaptation cycle time: 1–2 weeks manual → 1–2 days with DSPy. Label coverage: 10–100× more labels at same cost when swapping from o3 to gpt-oss-120b. Introduces three adaptation regimes by risk tolerance: full rewrite for cheap new targets (patterns/cross-model-prompt-adaptation), constrained instruction-library bullet-selection for production o3 (patterns/instruction-library-prompt-composition), disagreement-minimisation for fixed-model iteration (patterns/prompt-optimizer-flywheel). Names the overfitting failure modes (copying example-specific content, modifying the 1–5 scale) and the mitigation (invariants encoded directly in the feedback string).

systems/mlflow — co-deployed at Databricks for prompt optimization + eval.
systems/storex — the Databricks agent platform that bakes these patterns internally.
systems/dropbox-dash — Dropbox's production DSPy consumer.
patterns/tool-decoupled-agent-framework
patterns/prompt-optimizer-flywheel — the judge-disagreement-→-bullets-→-DSPy loop from Dropbox Dash.
patterns/cross-model-prompt-adaptation — retarget the same prompt across cheaper / different models via DSPy.
patterns/instruction-library-prompt-composition — constrained DSPy mode: select from vetted bullets, don't rewrite.
concepts/llm-as-judge — the target pattern DSPy excels at optimizing.
concepts/nmse-normalized-mean-squared-error — the scalar alignment objective at Dash.
concepts/structured-output-reliability — second optimisation axis exposed on smaller target models.