PATTERN Cited by 1 source
Multi-agent debate evaluation¶
Intent¶
Evaluate an LLM-generated or chatbot-produced output by running three LLM sub-agents in a structured debate:
- a Customer / Critic Agent that scrutinises the output from a critical perspective,
- a Support / Defender Agent that defends the output and its rationale,
- a Judge Agent that reviews the original output plus both prior agents' assessments and emits an impartial verdict.
The Customer and Support agents run independently and in parallel with no access to each other's output — this avoids inter-agent bias and forces each to produce its strongest independent argument. The Judge synthesises.
Instacart's LACE canonicalises this pattern for chatbot evaluation (Source: sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm, citing Du et al., "Improving Factuality and Reasoning in Language Models through Multiagent Debate", arXiv:2305.14325).
Structure¶
Chat session
│
┌──────────────┼──────────────┐
│ │ │
▼ ▼ ▼
Customer Support (original
Agent Agent input also
(critical) (defending) fed to Judge)
│ │ │
│ assessment │ assessment │
└──────────────┼──────────────┘
▼
Judge Agent
(impartial synthesis)
│
▼
Per-criterion verdict + rationale
Key shape properties:
- Parallel adversaries, no cross-talk. "The Customer and Support agents run independently and in parallel, without access to each other's assessments." Forces each side to produce its strongest independent case.
- Judge sees everything: original input + Customer's output
- Support's output. Has the full picture to synthesise from.
- No iteration loop. Unlike patterns/self-reflection-llm-evaluation which refines a single agent's judgment, or patterns/drafter-evaluator-refinement-loop which refines the artefact, debate is one forward pass with three parallel evaluators.
When to use¶
- Context-dependent criteria — judgments that require interpreting an output against non-obvious business context (Instacart's canonical example: a shopper's card declined is a company-authorized-card issue, not the customer's digital wallet). Debate's adversarial structure surfaces counterarguments a single-pass judge would miss.
- Simple universal criteria — debate also wins here, "near- perfect accuracy" for LACE on compliance / professionalism.
- Nuanced subjective criteria — debate can help but the gains drop off; LACE de-prioritises ambiguous subjective criteria regardless of engine.
Comparison with sibling patterns¶
| Pattern | Structure | Refinement loop | Good for |
|---|---|---|---|
| Direct prompting | one LLM, one pass | none | cheap baseline, simple criteria |
| patterns/self-reflection-llm-evaluation | one LLM, two passes (score → reflect) | self-loop on the same agent | catching first-pass bias with less cost than debate |
| Multi-agent debate | three specialised LLM roles, parallel + synthesis | no loop; structured disagreement is the mechanism | context-dependent + subjective criteria where adversarial viewpoint is load-bearing |
| patterns/director-expert-critic-investigation-loop | directed multi-agent with a central orchestrator | iterative | investigation / multi-step reasoning, not scoring |
| patterns/drafter-evaluator-refinement-loop | generate → evaluate → refine the artefact | refines the output, not the judgment | producing the output, not judging an existing one |
LACE-specific accuracy numbers¶
LACE reports (with debate as the engine):
- Simple criteria (e.g. professionalism under Compliance): "near-perfect accuracy, precisely capturing our requirements better than alternative methods."
- Context-dependent criteria (e.g. contextual relevancy under Answer Correctness): >90% accuracy, with remaining errors "typically stem from ambiguous scenarios or gaps in the embedded knowledge base."
- Subjective criteria: not pursued past a directional-check quality bar regardless of engine.
No head-to-head win-rate matrix vs. direct prompting or reflection is published.
Pairs well with¶
- concepts/binary-vs-graded-llm-scoring — debate emits binary per-criterion verdicts + rationales; the Judge Agent resolves Customer vs. Support into a single True/False cleanly.
- concepts/decouple-reasoning-from-structured-output — LACE runs debate with a strong reasoning model (o1-preview) producing free-form prose, then a separate step converts to JSON; avoids the reasoning-quality loss from grammar-constrained decoding on the complex multi-agent output.
- patterns/human-aligned-criteria-refinement-loop — debate judgments get compared to human ratings; misalignments drive criteria-prompt refinement.
Tradeoffs / gotchas¶
- Cost: 3× LLM calls per evaluation vs. direct prompting (one per role). For offline evaluation (LACE's case) this is fine; for real-time paths this is often unacceptable.
- Role-prompt engineering. Each agent's role prompt has to genuinely commit it to its position without the LLM softening "as an AI I cannot adopt a critical stance." System prompts with explicit role-play and scoring criteria matter.
- The Judge needs room to disagree. If the Judge's prompt defaults to siding with the majority or the more confident agent, you've reduced debate to weighted voting. The Judge's rubric must explicitly instruct independent reasoning over both inputs and the original chat.
- Not a fit for ranking tasks. Debate is for pass/fail judgements per criterion. For "which of N candidates is best?" use a different shape (tournament, direct scoring with a graded rubric).
- The "Customer" role is a scare-quoted LLM, not a real user. The post explicitly notes "all 'agents' refer to LLM-based simulated roles, not actual users or support staff." Debate is a scoring architecture, not a user study.
Seen in¶
- sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm — canonical wiki instance at Instacart LACE: three-agent Customer + Support + Judge structure, parallel critics with no cross-talk, Judge sees both + original. Cites Du et al. 2023 (arXiv:2305.14325) as the literature basis for multi-agent debate.
Literature¶
- Du et al. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325.