Skip to content

CONCEPT Cited by 1 source

Credibility scoring rubric

Definition

A credibility scoring rubric is a named, fixed-grained scoring scale — typically 5 levels — used by a critic agent to annotate peer-agent findings with numeric credibility scores rooted in explicit evidence criteria, not free-form confidence language.

The rubric converts LLM-as-judge output from a binary pass/fail verdict into a continuous scalar over a small vocabulary, which lets downstream consumers (other agents, aggregators, human auditors) reason probabilistically: "this finding is worth pursuing, that finding is not yet trustworthy, this finding is corroborated enough to go in the final report."

Canonicalised by Slack's Security Engineering team in their Managing context in long-run agentic applications post (Source: sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications), with disclosed distribution over 170,000 reviewed findings — the largest empirical credibility-scoring disclosure the wiki has.

Slack's five-level rubric (verbatim)

Score Label Criteria % of 170,000 findings
0.9-1.0 Trustworthy Supported by multiple sources with no contradictory indicators 37.7%
0.7-0.89 Highly-plausible Corroborated by a single source 25.4%
0.5-0.69 Plausible Mixed evidence support 11.1%
0.3-0.49 Speculative Poor evidence support 10.4%
0.0-0.29 Misguided No evidence provided or misinterpreted 15.4%

Why a rubric rather than free-form confidence

1. Free-form confidence language is uncalibrated

LLMs asked "how confident are you?" produce language like "highly confident", "fairly sure", "not certain". These phrases have no fixed numeric mapping across runs and no comparable semantics across findings. A rubric forces the Critic to pick one of five bands with explicit evidence criteria.

2. Downstream consumers need a membership test

The Timeline task's first consolidation rule is "include only events supported by credible citations — speculation doesn't belong on the Timeline". This is a membership test: is this finding above the plausibility threshold (0.5) or below? A rubric gives this question a precise answer; "fairly confident" does not.

3. Rubric bands encode the evidence model

Slack's bands are not "90% sure" vs. "80% sure". They are:

  • 0.9+: multi-source corroboration, no contradictions
  • 0.7+: single-source corroboration
  • 0.5+: mixed evidence
  • 0.3+: poor evidence
  • 0.0+: no evidence / misinterpreted

The rubric encodes a simple evidence model (multi-source / single-source / mixed / poor / none). The numeric score is a shorthand for this model, not an expression of probabilistic belief.

The disclosed distribution is canonical

The 170,000-finding distribution is load-bearing evidence for several claims:

  • Sub-plausibility rate: 25.8% (10.4% Speculative + 15.4% Misguided). This is the canonical empirical support for "the Critic catches real problems" — without the Critic, roughly one finding in four would reach the Director as authoritative.
  • 15.4% Misguided is surprisingly high. "No evidence provided or misinterpreted" — this bucket captures Expert hallucinations and methodology errors. The double-digit share is the canonical evidence that hallucinations are routine, not rare, in Expert agents under production workloads.
  • Trust pyramid: 63.1% above plausibility / 11.1% at plausibility / 25.8% below. The Critic is not primarily filtering a noisy stream — most findings pass — but its filter on the bottom quartile is load-bearing.

Design mechanics

Three design choices co-occur in Slack's rubric:

1. Numeric-range bands, not integer levels

The rubric uses numeric ranges (0.9-1.0, 0.7-0.89, …) rather than integer levels (5, 4, 3, 2, 1). This lets the Critic distinguish "0.85" from "0.78" without requiring a new level. The labels are rubric-as-taxonomy; the numbers are rubric-as-score.

2. Evidence criteria, not belief-state language

Each band's criteria is about observable evidence (number of sources, presence of contradictions, interpretation quality), not about the critic's internal belief state. This helps calibration: two critics presented with the same evidence should arrive at similar scores.

3. Pass/fail threshold is explicit

The plausibility threshold (0.5) is the membership test for downstream consumers (Timeline inclusion). Below-0.5 findings are filtered out of the narrative. This is a rubric with a canonical operating point, not just a scoring scale.

Where to sit the Critic in the model pyramid

Slack discloses three mitigations against Critic-hallucination in rubric application:

  1. Stronger model tier. The Critic runs on mid-tier (higher-cost) models. "Because the Critic only reviews submitted findings rather than the entire Expert run, the number of tokens required is kept within reasonable limits. While stronger models are still subject to hallucination, research suggests they err less frequently." (Source: sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications, citing [arxiv 2411.04368].)
  2. Narrow instructions. "The agent is instructed to only make a judgement on the submitted findings" — not to generate new findings, not to speculate about what Experts missed, just to score what's in front of it.
  3. Downstream coherence check. The Timeline task re-examines findings for narrative consistency; rubric- passed findings can still be pruned if they break the story.

What the rubric does not disclose

Slack's disclosure is the distribution — it does not show the per-dimension breakdown inside each band. Open questions:

  • Multi-dimensional scoring. Does the Critic score each finding on (evidence strength × source quality × interpretation defensibility × methodology soundness) and combine, or does it produce one score directly?
  • Inter-rater reliability. Has Slack measured the Critic's agreement with human auditors on the same finding? No published κ coefficient.
  • Calibration drift. Do rubric applications drift over time as model versions change? No disclosed re-calibration process.
  • Rubric version history. Is this the rubric Slack launched with, or a refined version? Has the plausibility threshold moved?

Contrasts

  • vs. concepts/llm-as-judge pass/fail — classic LLM-as-judge produces binary verdicts. Credibility scoring is continuous + band-labelled, enabling probabilistic downstream reasoning.
  • vs. concepts/llm-hallucination self-evaluation — asking an agent to rate its own confidence is subject to concepts/self-approval-bias. The credibility rubric is applied by a separate critic, closing that loop.
  • vs. reward models in RLHF — reward models score outputs on reward gradients for training. Credibility scoring rubrics score outputs on evidence gradients for downstream operational decisions.
  • vs. human-auditor scoring — a human auditor can operate on this rubric, but the disclosed 170,000-finding corpus is machine-scored. The rubric is designed to let both humans and machines apply it.

Seen in

Last updated · 470 distilled / 1,213 read