Skip to content

CONCEPT Cited by 2 sources

Weakly-adversarial critic

Definition

A weakly-adversarial critic is an agent positioned to audit peer agents' work for hallucinations, analytical gaps, and interpretation variability — pointed at them, not cooperating with them — but not fully adversarial: it shares the task's goal and is rewarded for improving the overall outcome, not for winning against its peers.

The stance is architecturally distinct from three adjacent roles:

Stance Who wins Shared goal?
Cooperative helper Everyone together Yes — critic helps peer succeed
Weakly adversarial critic The task, via the critic catching errors Yes — critic and peer share outcome, but critic scores peer's work
Fully adversarial (red team) The critic if it finds a flaw No — critic may be rewarded for finding flaws regardless of task outcome

The "weakly" qualifier matters: without it, the critic's incentive drifts toward finding-flaws-at-all-costs (paranoia, false-positive-heavy output); without the adversarial element, the critic slides into "looks good to me" collaborative approval. See concepts/self-approval-bias for the failure mode the stance is designed to prevent.

Named and canonicalised in Slack's Streamlining security investigations with agents post (Source: sources/2025-12-01-slack-streamlining-security-investigations-with-agents).

Slack's verbatim framing

"The weakly adversarial relationship between the Critic and the expert group helps to mitigate against hallucinations and variability in the interpretation of evidence."

(Source: sources/2025-12-01-slack-streamlining-security-investigations-with-agents)

The Critic's job is to "assess and quantify the quality of findings made by domain experts using a rubric we've defined" and to "annotate the experts' findings with its own analysis and a credibility score for each finding." Slack calls the Critic a "meta-expert".

Why weakly and not fully adversarial

Too cooperative fails: self-approval bias

A single-model pipeline that generates and evaluates its own output hits concepts/self-approval-bias — the model is disproportionately likely to approve its own generations. Even with a separate cooperative-stance critic, if the critic is prompted to be "helpful" to the peer, it inherits a confirmation-oriented tone; hallucinations slip through.

Too adversarial fails: false-positive paranoia

A fully adversarial (red-team) critic is rewarded for finding problems. Applied to a production investigation stream, this produces:

  • High false-positive rate"might be X" findings escalated as if confirmed.
  • Analysis paralysis — the investigation can't progress because the critic always finds more doubts.
  • Reader fatigue — the Director (or a human) receiving the critic's output stops taking it seriously.

Weakly adversarial is the pragmatic middle

The critic shares the investigation's goal of "reach a defensible conclusion" (so it has skin in the outcome) but is oriented toward auditing the method, not the goal. It asks "is this finding well-supported by the evidence?", not "is the peer my ally?" or "can I find any reason to reject this?"

Canonical emergent behaviour

Slack discloses the canonical payoff in an edited worked example: an Expert reviewed a process-ancestry chain and "incorrectly assessed credential handling as secure". The Critic noticed a credential exposure in the ancestry that the Expert missed, flagged it with its own analysis, and the Director pivoted the investigation to focus on this issue.

Verbatim: "What is notable about this result is that the expert did not raise the credential exposure in its findings; the Critic noticed it as part of its meta-analysis of the expert's work." (Source: sources/2025-12-01-slack-streamlining-security-investigations-with-agents)

This is the "Critic catches what Expert missed" payoff pattern — the operational evidence that the weakly-adversarial stance produces real value, not just ceremony.

Design mechanics

To implement a weakly-adversarial critic, three mechanisms tend to co-occur:

  1. Separate model invocation. The critic is a separate invocation, not the same model asked to self-evaluate. See patterns/one-model-invocation-per-task.
  2. Rubric-driven output. The critic scores against an explicit rubric (dimensions, scale, pass/fail criteria) rather than free-form "looks good". Rubric shape forces structured engagement.
  3. Credibility scores, not binary pass/fail. The critic emits a credibility score per finding (not a pass/fail verdict). This lets downstream consumers (the Director) reason probabilistically rather than binary-gating.

Contrasts

  • vs. LLM-as-judge (pass/fail)concepts/llm-as-judge is often a binary pass/fail evaluator. A weakly-adversarial critic is a scorer + annotator, not a gate.
  • vs. multi-round critic-fixer loopspatterns/multi-round-critic-quality-gate runs critics in rounds with fixer agents between them; the critic is still weakly-adversarial in stance but operates in a write-review-revise loop. The Slack Spear variant runs the critic once per round, and the Director decides what to do with the critique (pivot, continue, conclude) — orchestration logic is apex, not middle-tier.
  • vs. drafter-evaluator refinementpatterns/drafter-evaluator-refinement-loop pairs a drafter with an evaluator and retries on failure. The Slack variant is higher-level: the evaluator (Critic) doesn't gate the output; it augments it with metadata that a third agent (Director) interprets.
  • vs. specialised sub-reviewer agentspatterns/specialized-reviewer-agents decomposes the review surface by domain (security, perf, code quality); weakly-adversarial stance is orthogonal and applies within each reviewer.

Where the weakness matters

Pushing the critic toward fully adversarial (stronger prompts for skepticism, bias toward flagging) is a tuning knob. Production systems will land somewhere on the cooperative-to-adversarial spectrum; "weakly" is Slack's point-on-the-spectrum, not an inherent property. Teams calibrating their own critic need to check:

  • False-positive rate — is the critic over-reporting non-issues?
  • False-negative rate — is the critic approving findings it should have challenged?
  • Timeliness — is critic latency acceptable for the loop's cadence?
  • Tier compatibility — does the critic tier in the knowledge pyramid have enough capacity to catch the Experts' mistakes? Pushing the Critic to too-cheap a tier breaks the stance's premise.

Caveats

  • Stance is asserted, not measured. Slack's claim that the Critic/Expert dynamic is "weakly adversarial" is not validated with disclosed metrics (hallucination-rate reduction, false-positive rate, etc.).
  • Rubric is opaque. The specific rubric Slack's Critic uses is not disclosed.
  • Model-family choice matters. The post doesn't disclose whether the Expert and Critic run different model families; cross-family criticism tends to catch more (parallels patterns/multi-round-critic-quality-gate's caveat on critic model-family choice).
  • Weakness is a dial, not a discrete state. The stance sits on a continuum; teams will tune toward one end or the other based on their false-positive/false-negative tolerance.

Empirical support (2026-04-13 update)

Slack's second post in the Spear series disclosed a 170,000 finding distribution across the Critic's 5-level credibility rubric (Source: sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications):

Score band Label %
0.9-1.0 Trustworthy 37.7%
0.7-0.89 Highly-plausible 25.4%
0.5-0.69 Plausible 11.1%
0.3-0.49 Speculative 10.4%
0.0-0.29 Misguided 15.4%

Sub-plausibility rate: 25.8% (Speculative + Misguided combined). This is canonical empirical support for the weakly-adversarial stance's value — without the Critic, roughly one finding in four would reach the Director as equally authoritative as trustworthy findings. An over-cooperative critic wouldn't produce a sub-plausibility fraction this large, which is how we know the stance is actually adversarial-enough. See concepts/credibility-scoring-rubric for the full rubric + distribution discussion.

Seen in

Last updated · 470 distilled / 1,213 read