Skip to content

PATTERN Cited by 1 source

Self-reflection LLM evaluation

Intent

Improve an LLM judge's verdict by running two sequential passes with the same agent:

  1. Pass 1 — Initial evaluation. Score the item against the criterion, same as direct prompting.
  2. Pass 2 — Self-reflection. The LLM inspects its own Pass 1 output, identifies potential biases or gaps, and refines its verdict.

The mechanism is meant to let the model "catch and correct its own limitations" and deliver more nuanced assessments than a single pass would produce (Source: sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm, citing Madaan et al. 2022 arXiv:2212.09561, Jang 2023, Madaan et al. 2024 arXiv:2406.10400).

Structure

        Input
   ┌──────────────┐
   │ LLM judge    │  initial score + rationale
   │ (pass 1)     │
   └──────┬───────┘
   ┌──────────────┐
   │ LLM judge    │  reflects on pass 1 →
   │ (pass 2:     │  identifies bias / gaps →
   │  reflection) │  revised score + rationale
   └──────┬───────┘
    Final verdict

Unlike patterns/multi-agent-debate-evaluation, there's only one agent — the same LLM examines its own prior output. No role specialisation, no adversarial structure.

When to use

  • Intermediate complexity criteria where direct prompting misses nuance but multi-agent debate's 3× cost is unjustified.
  • When the Pass-1 failure mode is typified biases (recency bias, length bias, confident-wrong) that a reflection prompt can surface without needing a separate adversary.
  • As a cost-bounded upgrade path from direct prompting before committing to debate.

LACE's verdict

Instacart's LACE benchmarked reflection against direct prompting and debate. They picked debate as the production default for context-dependent and subjective criteria. Implication — reflection was not sufficient on its own for LACE's hardest criteria. The post doesn't publish a reflection-vs- debate win-rate matrix but positions debate as "highly effective" for the context-dependent cases.

This is consistent with the broader pattern: self-reflection helps with first-pass errors the LLM itself can identify. It's weaker than debate when the error requires a different perspective — e.g. Instacart's "shopper used a company-authorized card, not the user's digital wallet" requires a reader who actively pushes back on the chatbot's reasoning, which Customer / Support roles supply explicitly.

Comparison

Pattern Agents Passes Cost Strength
Direct prompting 1 1 cheap baseline
Self-reflection 1 2 ~2× catches first-pass biases the LLM can self-identify
patterns/multi-agent-debate-evaluation 3 3 (2 parallel + judge) ~3× different-perspective errors; adversarial surfacing
patterns/drafter-evaluator-refinement-loop 2+ iterative 3×–10× refines the artefact, not the judgement

Tradeoffs / gotchas

  • Reflection can amplify biases instead of fixing them. If the LLM's Pass-1 bias is systematic (e.g. always lenient), a reflection pass prompted generically ("are there any biases?") may rationalise rather than correct. Reflection prompts need to be specific — "consider whether you over-weighted tone over correctness" — to avoid confabulation.
  • 2× latency + cost vs. direct prompting, with no parallelism opportunity (Pass 2 depends on Pass 1).
  • On hard multi-turn reasoning tasks, debate dominates. If the Pass-1 error requires a counterargument the single LLM isn't inclined to make on its own, reflection won't reach it.

Pairs well with

Seen in

Last updated · 517 distilled / 1,221 read