SYSTEM Cited by 2 sources
MLflow¶
MLflow is an open-source ML lifecycle platform originated at Databricks: experiment tracking, model packaging, model registry, and (in MLflow 3) GenAI evaluation including LLM judges and prompt-optimization tooling. It's the house Databricks builds its internal agent-evaluation infrastructure inside.
Why it matters for system design¶
- LLM judges are a first-class primitive: a separate LLM scores another model's output against a rubric, surfacing regressions a human eval can't scale to. This is the evaluation loop for non-deterministic agents.
- Prompt-optimization tech — MLflow's GenAI surfaces compile with frameworks like systems/dspy to iterate on prompts against measurable metrics.
- Snapshot + replay workflows for agents rest on MLflow's tracking/eval primitives at Databricks.
- MLflow 3 GenAI tracing is the tracing substrate for Unity AI Gateway — specifically named for the Claude Code integration path. This positions MLflow as the observability plane for governed coding-agent traffic inside a Databricks customer's fleet. See concepts/centralized-ai-governance pillar 1 (security + audit).
Seen in¶
- sources/2025-12-03-databricks-ai-agent-debug-databases — the post references MLflow's LLM judges docs as the scoring tool for Storex's validation framework and names MLflow prompt-optimization tech as the inspiration for their internal DsPy-style framework.
- sources/2026-04-17-databricks-governing-coding-agent-sprawl-with-unity-ai-gateway — MLflow named as the centralised tracing substrate for Unity AI Gateway ("centralized tracing with MLflow"), specifically via the Claude Code integration doc link.
Related¶
- systems/dspy — shared lineage + Databricks deployment story.
- systems/storex — internal agent platform using MLflow-style judges for regression.
- systems/unity-ai-gateway — tracing substrate role.
- concepts/llm-as-judge
- concepts/centralized-ai-governance
- patterns/snapshot-replay-agent-evaluation