Skip to content

CONCEPT Cited by 1 source

LLM evaluation dimensions

Definition

LLM evaluation dimensions are the top-level axes of a rubric used to score an LLM-powered system's output quality — each dimension decomposes into granular per-criterion True/False checks (concepts/binary-vs-graded-llm-scoring), and per-criterion scores aggregate into a holistic session / item score.

Rubric design "begins with clearly defining what 'good' looks like" (Instacart LACE). Picking the dimensions is the highest-leverage decision — changing dimensions post-hoc invalidates all accumulated evaluation data.

Instacart LACE's five dimensions (chatbot-specific)

LACE scores customer-support chatbot sessions against:

  1. Query Understanding — did the chatbot correctly interpret what the user was asking?
  2. Answer Correctness — sub-criteria: contextual relevancy, factual correctness, consistency, usefulness.
  3. Chat Efficiency — conciseness, turn count to resolution.
  4. Client Satisfaction — user-perceived helpfulness.
  5. Compliance — tone, politeness, professionalism, policy adherence.

Two load-bearing meta-properties:

  • Hierarchical. Each dimension has multiple criteria; each criterion is scored binary; aggregation goes criteria → dimension → session.
  • Rationale alongside verdict. Every criterion emits both True/False + free-form explanation — the rationale "helps surface the model's reasoning and guide future refinement."

(Source: sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm.)

Complexity tiers within a rubric

LACE observed that evaluation criteria, across all five dimensions, group into three complexity tiers based on the judge's context requirement — a useful framing for any LLM-as-judge rubric:

Tier Example Judge requirement LACE accuracy (debate engine)
Simple Compliance → professionalism Universal standards; no business context "near-perfect"
Context-dependent Answer Correctness → contextual relevancy Business-model knowledge (e.g. "shoppers use company-authorized cards, not the customer's payment method") >90% (with static Instacart knowledge template; RAG is future work)
Subjective Chat Efficiency → answer conciseness Varying human preferences; LLMs apply stricter brevity standards than humans intentionally not pursued

Tier determines how to spend iteration budget:

  • Simple: cheap wins; debate + well-written binary criterion suffices.
  • Context-dependent: embed operational knowledge into the judge's prompt; plan for dynamic retrieval (concepts/context-engineering) as the knowledge base grows.
  • Subjective: keep as directional-check only; fix the system under evaluation, not the judge — "a low-ROI path" to refine ambiguous criteria.

Criteria overlap problem

Criteria that look orthogonal on a whiteboard often aren't in the LLM's eyes. LACE found:

"LLMs can struggle to draw clean boundaries between closely related categories. This can lead to situations where multiple criteria are marked as failed, even when only one is clearly at fault."

LACE's stance: don't enforce rigid separation — the rationale output lets you read which criterion the judge considers the primary cause, which is usually what matters downstream.

Tradeoffs in dimension design

  • Too few dimensions misses specific failure modes (a single "quality" dimension can't distinguish hallucination from rudeness).
  • Too many dimensions amplifies criteria overlap + cost of each evaluation (more criteria = more judge calls or longer prompt).
  • Domain-specific dimensions (Compliance matters for customer support; Localisation matters for i18n; Safety matters for agents with tool access) should replace, not supplement, generic ones once the specific version is mature.

Seen in

Last updated · 517 distilled / 1,221 read