Skip to content

SYSTEM Cited by 7 sources

MLflow

MLflow is an open-source ML lifecycle platform originated at Databricks: experiment tracking, model packaging, model registry, and (in MLflow 3) GenAI evaluation including LLM judges and prompt-optimization tooling. It's the house Databricks builds its internal agent-evaluation infrastructure inside.

Why it matters for system design

  • LLM judges are a first-class primitive: a separate LLM scores another model's output against a rubric, surfacing regressions a human eval can't scale to. This is the evaluation loop for non-deterministic agents.
  • Prompt-optimization tech — MLflow's GenAI surfaces compile with frameworks like systems/dspy to iterate on prompts against measurable metrics.
  • Snapshot + replay workflows for agents rest on MLflow's tracking/eval primitives at Databricks.
  • MLflow 3 GenAI tracing is the tracing substrate for Unity AI Gateway — specifically named for the Claude Code integration path. This positions MLflow as the observability plane for governed coding-agent traffic inside a Databricks customer's fleet. See concepts/centralized-ai-governance pillar 1 (security + audit).

Seen in

  • sources/2026-05-22-databricks-observability-any-agent-anywhere-otel-unity-catalogOTel-tracing-direct-to-Unity-Catalog face. Sixth+ MLflow face on the wiki: not just experiment tracking, prompt optimisation, snapshot-replay, governed-coding-agent observability, model-version-registry-for-audit, or continuous-eval-against-concept-drift (the prior faces) — but the framework-side OTel instrumentation surface that writes spans/logs/metrics direct to UC-managed Delta tables via Zerobus Ingest. The feature: dual mlflow.<lib>.autolog() + @MLflow.trace instrumentation pattern, table-provisioning-via-MLflow (creates the six UC OTel trace tables: _otel_spans, _otel_logs, _otel_metrics, _otel_annotations, _trace_unified, _trace_metadata), "Previous limits on traces per experiment are no longer applicable" (the per-experiment trace cap is removed), agent-runs-anywhere portability "In fact the support assistant agent example that was used for this blog is deployed locally", and the prod-traces-bootstrap-eval flow"One effective approach is to bootstrap this dataset from real traces. Because these prompts originate from actual user interactions, they better represent the scenarios your agent must handle compared to purely synthetic test cases." Same judges run continuously on live traces for production monitoring: "MLflow can automatically evaluate live traces using the same judges, helping us quickly detect regressions, drift, and emerging failure patterns." Native MLflow Experiment UI dashboards: "trace volume, errors, latency, token usage, and cost""For most teams, that's enough to monitor day-to-day agent health." Customer scale points: Experian "hundreds of thousands of traces", Superhuman "hundreds of thousands of traces per day" (explicitly replacing a custom point solution: "that maintenance burden was a real pain point for our teams"), SmartSheet "tens of thousands of evaluations" in a "three-day co-build", The Standard governing prompts in UC. Composes with systems/mlflow-otel-tracing (the tracing surface), systems/zerobus-ingest (managed receiver), systems/uc-otel-trace-tables (storage), concepts/instrumentation-storage-decoupling (OTel as protocol-portable boundary), concepts/production-traces-as-evaluation-substrate, patterns/managed-otel-ingestion-direct-to-lakehouse, patterns/bootstrap-eval-dataset-from-production-traces, and patterns/component-level-latency-from-otel-spans.

  • sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-libraryProduction-monitoring-against-concept-drift face. New MLflow face on the wiki: not just experiment tracking, prompt optimisation, snapshot-replay, observability for governed-coding-agent traffic, or model-version-registry-for- audit (the prior five faces) but the continuous-evaluation substrate against concept drift in production, paired with a deliberately conservative LLM-as-judge ternary. "We implemented a comprehensive evaluation strategy using 'LLM as a Judge' alongside manual labeling sessions. MLflow capabilities allowed us to constantly evaluate model performance to prevent concept drift." Inside the CSAF→Delta ETL: "we let a dedicated judge model review another model's response and decide whether it looks acceptable. The judge's job is simple and conservative: mark each result as pass, looks correct, fail, looks wrong, or unknown, not enough information." Judge outputs persist in Delta tables; "custom MLflow GenAI judges" run structured evaluations, "giving us a consistent way to monitor quality, compare versions, and catch regressions across many LLM use cases — without building a bespoke evaluation stack for every new workflow." Composes with concepts/llm-as-judge (the third LLM-as-judge face — see the canonical concept page) and patterns/llm-judge-as-inline-pipeline-stage. Canonical wiki instance: systems/claroty-cps-library (Entity Resolution at 17M+ assets).

  • sources/2026-05-13-databricks-clinical-operations-intelligence-belongs-on-the-lakehouseRegulated-ML model-version-registry-for-audit face. New MLflow face on the wiki: not just experiment tracking and LLM evaluation (the prior four faces) but the versioning substrate that anchors per-prediction SHAP attributions to the exact model version that produced them, in regulated decision-support contexts. "Every prediction carries a SHAP attribution stored as a governed Unity Catalog Delta table — versioned in MLflow, lineaged through Unity Catalog, queryable — the rationale behind a site selection is as auditable as the score itself." The load-bearing property: when a regulator asks about a recommendation made nine months ago, the audit trail leads to the exact model version that produced it via MLflow, not the current production version. Regulatory backdrop: 21 CFR Part 11, ICH E6(R3), FDA GMLP. Reference implementation: systems/site-feasibility-workbench uses MLflow to version TA-segmented LightGBM models trained on sponsor CTMS / EDC / IRT history; SHAP attributions land in UC governed Delta tables with the MLflow run-id as a primary lineage reference. Canonical wiki instance of concepts/governed-shap-attribution-table — MLflow is the versioning leg of the three-property substrate (versioning + lineage + SQL queryability).

  • sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-aiReproducibility co-requirement face: MLflow is named alongside Delta time travel + CI/CD for pipelines as the experiment + model-version tracking leg of the lakehouse's reproducibility story. "Reproducibility: versioning and time travel for datasets, CI/CD for pipelines/jobs, and MLflow for experiment and model version tracking." Extends MLflow's wiki role beyond LLM-eval + tracing into the training-experiment-history substrate for multimodal pipelines landing in Delta under UC.

  • sources/2025-12-03-databricks-ai-agent-debug-databases — the post references MLflow's LLM judges docs as the scoring tool for Storex's validation framework and names MLflow prompt-optimization tech as the inspiration for their internal DsPy-style framework.
  • sources/2026-04-17-databricks-governing-coding-agent-sprawl-with-unity-ai-gateway — MLflow named as the centralised tracing substrate for Unity AI Gateway ("centralized tracing with MLflow"), specifically via the Claude Code integration doc link.
  • sources/2026-02-02-yelp-back-testing-engine-ad-budget-allocation — Yelp's Back-Testing Engine uses MLflow as the experiment store + cross-candidate visualization substrate for ad-budget-allocation simulations. Every candidate's input parameters and output metrics get logged to MLflow, which runs on a remote server; MLflow's UI provides comparison across candidates without extra coding. This is the wiki's first non-LLM-evaluation MLflow Seen-in — MLflow as domain-general experiment database, outside its usual model- training context.
Last updated · 542 distilled / 1,571 read