CONCEPT Cited by 1 source

LLM evaluation dimensions¶

Definition¶

LLM evaluation dimensions are the top-level axes of a rubric used to score an LLM-powered system's output quality — each dimension decomposes into granular per-criterion True/False checks (concepts/binary-vs-graded-llm-scoring), and per-criterion scores aggregate into a holistic session / item score.

Rubric design "begins with clearly defining what 'good' looks like" (Instacart LACE). Picking the dimensions is the highest-leverage decision — changing dimensions post-hoc invalidates all accumulated evaluation data.

Instacart LACE's five dimensions (chatbot-specific)¶

LACE scores customer-support chatbot sessions against:

Query Understanding — did the chatbot correctly interpret what the user was asking?
Answer Correctness — sub-criteria: contextual relevancy, factual correctness, consistency, usefulness.
Chat Efficiency — conciseness, turn count to resolution.
Client Satisfaction — user-perceived helpfulness.
Compliance — tone, politeness, professionalism, policy adherence.

Two load-bearing meta-properties:

Hierarchical. Each dimension has multiple criteria; each criterion is scored binary; aggregation goes criteria → dimension → session.
Rationale alongside verdict. Every criterion emits both True/False + free-form explanation — the rationale "helps surface the model's reasoning and guide future refinement."

(Source: sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm.)

Complexity tiers within a rubric¶

LACE observed that evaluation criteria, across all five dimensions, group into three complexity tiers based on the judge's context requirement — a useful framing for any LLM-as-judge rubric:

Tier	Example	Judge requirement	LACE accuracy (debate engine)
Simple	Compliance → professionalism	Universal standards; no business context	"near-perfect"
Context-dependent	Answer Correctness → contextual relevancy	Business-model knowledge (e.g. "shoppers use company-authorized cards, not the customer's payment method")	>90% (with static Instacart knowledge template; RAG is future work)
Subjective	Chat Efficiency → answer conciseness	Varying human preferences; LLMs apply stricter brevity standards than humans	intentionally not pursued

Tier determines how to spend iteration budget:

Simple: cheap wins; debate + well-written binary criterion suffices.
Context-dependent: embed operational knowledge into the judge's prompt; plan for dynamic retrieval (concepts/context-engineering) as the knowledge base grows.
Subjective: keep as directional-check only; fix the system under evaluation, not the judge — "a low-ROI path" to refine ambiguous criteria.

Criteria overlap problem¶

Criteria that look orthogonal on a whiteboard often aren't in the LLM's eyes. LACE found:

"LLMs can struggle to draw clean boundaries between closely related categories. This can lead to situations where multiple criteria are marked as failed, even when only one is clearly at fault."

LACE's stance: don't enforce rigid separation — the rationale output lets you read which criterion the judge considers the primary cause, which is usually what matters downstream.

Tradeoffs in dimension design¶

Too few dimensions misses specific failure modes (a single "quality" dimension can't distinguish hallucination from rudeness).
Too many dimensions amplifies criteria overlap + cost of each evaluation (more criteria = more judge calls or longer prompt).
Domain-specific dimensions (Compliance matters for customer support; Localisation matters for i18n; Safety matters for agents with tool access) should replace, not supplement, generic ones once the specific version is mature.

systems/lyft-ai-localization-pipeline — Lyft's AI-localisation judge rubric uses a different dimension set tuned for translation quality (semantic fidelity + tone + locale-specific terminology).
systems/zalando-search-quality-framework — graded search relevance uses dimensions around visual/text match + product attributes + query intent.
systems/dash-relevance-ranker — Dropbox Dash uses a graded relevance rubric instead of binary; different problem shape (ranking vs. pass/fail).

Seen in¶

sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm — canonical wiki instance of chatbot-evaluation dimension set (Query Understanding / Answer Correctness / Chat Efficiency / Client Satisfaction / Compliance) + the three-tier (simple / context-dependent / subjective) complexity grouping.