Skip to content

PATTERN Cited by 1 source

Multi-agent debate evaluation

Intent

Evaluate an LLM-generated or chatbot-produced output by running three LLM sub-agents in a structured debate:

  • a Customer / Critic Agent that scrutinises the output from a critical perspective,
  • a Support / Defender Agent that defends the output and its rationale,
  • a Judge Agent that reviews the original output plus both prior agents' assessments and emits an impartial verdict.

The Customer and Support agents run independently and in parallel with no access to each other's output — this avoids inter-agent bias and forces each to produce its strongest independent argument. The Judge synthesises.

Instacart's LACE canonicalises this pattern for chatbot evaluation (Source: sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm, citing Du et al., "Improving Factuality and Reasoning in Language Models through Multiagent Debate", arXiv:2305.14325).

Structure

                   Chat session
         ┌──────────────┼──────────────┐
         │              │              │
         ▼              ▼              ▼
    Customer        Support          (original
    Agent           Agent             input also
    (critical)      (defending)       fed to Judge)
         │              │              │
         │ assessment   │ assessment   │
         └──────────────┼──────────────┘
                   Judge Agent
                   (impartial synthesis)
              Per-criterion verdict + rationale

Key shape properties:

  • Parallel adversaries, no cross-talk. "The Customer and Support agents run independently and in parallel, without access to each other's assessments." Forces each side to produce its strongest independent case.
  • Judge sees everything: original input + Customer's output
  • Support's output. Has the full picture to synthesise from.
  • No iteration loop. Unlike patterns/self-reflection-llm-evaluation which refines a single agent's judgment, or patterns/drafter-evaluator-refinement-loop which refines the artefact, debate is one forward pass with three parallel evaluators.

When to use

  • Context-dependent criteria — judgments that require interpreting an output against non-obvious business context (Instacart's canonical example: a shopper's card declined is a company-authorized-card issue, not the customer's digital wallet). Debate's adversarial structure surfaces counterarguments a single-pass judge would miss.
  • Simple universal criteria — debate also wins here, "near- perfect accuracy" for LACE on compliance / professionalism.
  • Nuanced subjective criteria — debate can help but the gains drop off; LACE de-prioritises ambiguous subjective criteria regardless of engine.

Comparison with sibling patterns

Pattern Structure Refinement loop Good for
Direct prompting one LLM, one pass none cheap baseline, simple criteria
patterns/self-reflection-llm-evaluation one LLM, two passes (score → reflect) self-loop on the same agent catching first-pass bias with less cost than debate
Multi-agent debate three specialised LLM roles, parallel + synthesis no loop; structured disagreement is the mechanism context-dependent + subjective criteria where adversarial viewpoint is load-bearing
patterns/director-expert-critic-investigation-loop directed multi-agent with a central orchestrator iterative investigation / multi-step reasoning, not scoring
patterns/drafter-evaluator-refinement-loop generate → evaluate → refine the artefact refines the output, not the judgment producing the output, not judging an existing one

LACE-specific accuracy numbers

LACE reports (with debate as the engine):

  • Simple criteria (e.g. professionalism under Compliance): "near-perfect accuracy, precisely capturing our requirements better than alternative methods."
  • Context-dependent criteria (e.g. contextual relevancy under Answer Correctness): >90% accuracy, with remaining errors "typically stem from ambiguous scenarios or gaps in the embedded knowledge base."
  • Subjective criteria: not pursued past a directional-check quality bar regardless of engine.

No head-to-head win-rate matrix vs. direct prompting or reflection is published.

Pairs well with

Tradeoffs / gotchas

  • Cost: 3× LLM calls per evaluation vs. direct prompting (one per role). For offline evaluation (LACE's case) this is fine; for real-time paths this is often unacceptable.
  • Role-prompt engineering. Each agent's role prompt has to genuinely commit it to its position without the LLM softening "as an AI I cannot adopt a critical stance." System prompts with explicit role-play and scoring criteria matter.
  • The Judge needs room to disagree. If the Judge's prompt defaults to siding with the majority or the more confident agent, you've reduced debate to weighted voting. The Judge's rubric must explicitly instruct independent reasoning over both inputs and the original chat.
  • Not a fit for ranking tasks. Debate is for pass/fail judgements per criterion. For "which of N candidates is best?" use a different shape (tournament, direct scoring with a graded rubric).
  • The "Customer" role is a scare-quoted LLM, not a real user. The post explicitly notes "all 'agents' refer to LLM-based simulated roles, not actual users or support staff." Debate is a scoring architecture, not a user study.

Seen in

Literature

  • Du et al. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325.
Last updated · 517 distilled / 1,221 read