Skip to content

SYSTEM Cited by 1 source

Langfuse

Definition

Langfuse is an LLM observability + evaluation + experiment-management platform. It provides trace ingestion, cost tracking, prompt management, and an automated LLM-as-judge evaluation harness for scoring Q/A pairs on rubric-based criteria. First Seen-in on the wiki: Yelp's Biz Ask Anything production grader stack.

Role at Yelp (2026-03-27)

Yelp uses Langfuse as the substrate for BAA's quality graders:

"The Langfuse-based grader is an automated evaluation system for Question/Answer pairs that uses LLMs to assess answer quality. It provides comprehensive observability, cost tracking, and experiment management through Langfuse integration, enabling detailed insights into evaluation performance and quality metrics. In production we handle this by extracting the logs for each question answering call and passing it to the langfuse based grader. This runs as a batch daily and generates statistics which are stored as a dataset on langfuse." (Source: sources/2026-03-27-yelp-building-biz-ask-anything-from-prototype-to-product)

Load-bearing operational properties at Yelp:

  • Daily batch cadence — grader runs on sampled production Q/A pairs once per day.
  • Rolling-average time series — statistics stored as a dataset on Langfuse; regressions caught via drift from baseline.
  • Three grader roles: Correctness, Completeness, Evidence Relevance. See concepts/llm-as-judge.

Comparison to adjacent systems

  • MLflow — Databricks' experiment-tracking + eval platform; hosts the judges primitive used by Databricks' Storex. Langfuse is a pure-play LLM-observability + eval platform without MLflow's broader ML-lifecycle scope.

Caveats

  • Stub page. The wiki's canonical Langfuse reference is the Yelp BAA ingest; deeper Langfuse architecture (trace ingestion path, prompt-management model, SDK surface) is not walked here.

Seen in

Last updated · 476 distilled / 1,218 read