SYSTEM Cited by 2 sources
Langfuse¶
Definition¶
Langfuse is an LLM observability + evaluation + experiment-management platform. It provides trace ingestion, cost tracking, prompt management, and an automated LLM-as-judge evaluation harness for scoring Q/A pairs on rubric-based criteria. First Seen-in on the wiki: Yelp's Biz Ask Anything production grader stack.
Role at Yelp (2026-03-27)¶
Yelp uses Langfuse as the substrate for BAA's quality graders:
"The Langfuse-based grader is an automated evaluation system for Question/Answer pairs that uses LLMs to assess answer quality. It provides comprehensive observability, cost tracking, and experiment management through Langfuse integration, enabling detailed insights into evaluation performance and quality metrics. In production we handle this by extracting the logs for each question answering call and passing it to the langfuse based grader. This runs as a batch daily and generates statistics which are stored as a dataset on langfuse." (Source: sources/2026-03-27-yelp-building-biz-ask-anything-from-prototype-to-product)
Load-bearing operational properties at Yelp:
- Daily batch cadence — grader runs on sampled production Q/A pairs once per day.
- Rolling-average time series — statistics stored as a dataset on Langfuse; regressions caught via drift from baseline.
- Three grader roles: Correctness, Completeness, Evidence Relevance. See concepts/llm-as-judge.
Role at Expedia STAR (2026-04-28)¶
Expedia's STAR (Service Telemetry Analyzer) uses Langfuse for prompt management + evaluation + tracing across its multi-step automated root-cause-analysis workflow:
"We are still in the early stages of evaluating the system. Given the complexity of this domain, we mostly rely on qualitative human assessment which includes subject matter experts (SMEs) and users. We also use Langfuse for prompt management, evaluation, and tracing. The results so far have been promising." (Source: sources/2026-04-28-expedia-expedias-service-telemetry-analyzer)
Load-bearing properties at Expedia:
- Prompt management — STAR's per-step prompts (role + domain + format) versioned in Langfuse.
- Tracing — every chain step + model call traced so SMEs can audit the reasoning path, not just the final output.
- Evaluation — SME-gated qualitative assessment; no automated grader disclosed at the current iteration.
STAR is the wiki's second in-production Langfuse instance (after Yelp BAA) and the first at an incident-RCA altitude.
Comparison to adjacent systems¶
- MLflow — Databricks'
experiment-tracking + eval platform; hosts the
judgesprimitive used by Databricks' Storex. Langfuse is a pure-play LLM-observability + eval platform without MLflow's broader ML-lifecycle scope.
Caveats¶
- Stub page. The wiki's canonical Langfuse reference is the Yelp BAA ingest; deeper Langfuse architecture (trace ingestion path, prompt-management model, SDK surface) is not walked here.
Seen in¶
- sources/2026-03-27-yelp-building-biz-ask-anything-from-prototype-to-product — Yelp's daily batch LLM-as-judge graders for BAA run on Langfuse.
- sources/2026-04-28-expedia-expedias-service-telemetry-analyzer — Expedia STAR uses Langfuse for prompt management, evaluation, and tracing in its incident-RCA pipeline. Second wiki production instance; first at incident-RCA altitude.
Related¶
- concepts/llm-as-judge — the evaluation pattern Langfuse operationalises.
- systems/yelp-biz-ask-anything — first canonical wiki consumer.
- systems/expedia-star — second canonical wiki consumer (incident-RCA altitude).
- systems/mlflow — adjacent LLM-eval platform.
- companies/yelp
- companies/expedia