CONCEPT Cited by 1 source

Binary vs. graded LLM scoring¶

Definition¶

Binary scoring uses True/False per-criterion verdicts from an LLM judge; session-level or item-level holistic scores aggregate many binary verdicts. Graded scoring uses a numeric scale (1–5, 1–10, Likert).

Instacart's LACE team explicitly benchmarked both and chose binary for their chatbot-evaluation rubric (Source: sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm).

Why binary wins (per Instacart's LACE)¶

"Our experiments also revealed binary scoring to be more effective than granular scales. While a 1–10 scale might seem more precise, binary evaluations provide greater consistency, simplicity, and alignment with human judgment. This streamlined approach requires less extensive prompt engineering while maintaining robust performance — a lesson that reinforced our philosophy of seeking practical effectiveness over theoretical perfection."

Three mechanisms:

Consistency. LLM judges (and humans) drift between "7/10" and "8/10" run-to-run on the same input. True/False has no such drift — either the criterion is satisfied or it isn't.
Prompt-engineering cost. Graded scales require the judge's prompt to anchor each point on the scale with enough detail for comparable scoring across runs + evaluators. Binary needs only a clear yes/no definition.
Human-alignment. Human raters align more reliably on True/False than on a specific integer on a multi-point scale.

Reconciling binary per-criterion with holistic judgement¶

Binary scoring per criterion does not mean the system emits a single bit per session. LACE:

Emits a True/False per criterion across 5 dimensions (many criteria per dimension).
Emits a free-form rationale per criterion — preserves the information a graded scale was trying to capture ("how badly was this wrong?") without asking the LLM to emit a noisy number.
Aggregates per-criterion verdicts into a session-level holistic score (how many criteria passed).

The rationale is the escape valve. Graded numbers are the wrong shape for judgement — LLMs can reason about why more reliably than they can calibrate how much.

Tradeoffs¶

Binary hides gradient information at the criterion level. "Almost-correct-but-wrong-brand" and "completely hallucinated" both score False on factual correctness. If you need fine-grained severity, supplement with a second criterion ("severe enough to escalate?") rather than switching to a graded scale — see also OEC-style single-number decision metrics aggregated from binary criteria.
Criteria overlap bites harder. When two binary criteria are closely related (e.g. contextual relevancy and factual correctness), the judge may mark both False when only one is clearly at fault. LACE's stance: "prioritize identifying the primary issue affecting the interaction" rather than enforcing rigid separation; rationale output makes the primary cause auditable.
Subjective criteria resist both. For axes like conciseness where preferences vary, neither binary nor graded scoring aligns with humans. LACE keeps subjective criteria only as a directional regression check and spends effort on the chatbot rather than on refining the judge for ambiguous criteria.
Graded scales still have their place for calibration and ranking tasks — "of these N candidate responses, which is best?" — which is a different decision shape than per-criterion pass/fail rubrics. See the Dropbox Dash relevance-judge for a graded-relevance contrast case.

Seen in¶

sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm — Instacart LACE's explicit binary choice + "a lesson that reinforced our philosophy of seeking practical effectiveness over theoretical perfection."