PATTERN Cited by 1 source

Self-reflection LLM evaluation¶

Intent¶

Improve an LLM judge's verdict by running two sequential passes with the same agent:

Pass 1 — Initial evaluation. Score the item against the criterion, same as direct prompting.
Pass 2 — Self-reflection. The LLM inspects its own Pass 1 output, identifies potential biases or gaps, and refines its verdict.

The mechanism is meant to let the model "catch and correct its own limitations" and deliver more nuanced assessments than a single pass would produce (Source: sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm, citing Madaan et al. 2022 arXiv:2212.09561, Jang 2023, Madaan et al. 2024 arXiv:2406.10400).

Structure¶

        Input
          │
          ▼
   ┌──────────────┐
   │ LLM judge    │  initial score + rationale
   │ (pass 1)     │
   └──────┬───────┘
          │
          ▼
   ┌──────────────┐
   │ LLM judge    │  reflects on pass 1 →
   │ (pass 2:     │  identifies bias / gaps →
   │  reflection) │  revised score + rationale
   └──────┬───────┘
          │
          ▼
    Final verdict

Unlike patterns/multi-agent-debate-evaluation, there's only one agent — the same LLM examines its own prior output. No role specialisation, no adversarial structure.

When to use¶

Intermediate complexity criteria where direct prompting misses nuance but multi-agent debate's 3× cost is unjustified.
When the Pass-1 failure mode is typified biases (recency bias, length bias, confident-wrong) that a reflection prompt can surface without needing a separate adversary.
As a cost-bounded upgrade path from direct prompting before committing to debate.

LACE's verdict¶

Instacart's LACE benchmarked reflection against direct prompting and debate. They picked debate as the production default for context-dependent and subjective criteria. Implication — reflection was not sufficient on its own for LACE's hardest criteria. The post doesn't publish a reflection-vs- debate win-rate matrix but positions debate as "highly effective" for the context-dependent cases.

This is consistent with the broader pattern: self-reflection helps with first-pass errors the LLM itself can identify. It's weaker than debate when the error requires a different perspective — e.g. Instacart's "shopper used a company-authorized card, not the user's digital wallet" requires a reader who actively pushes back on the chatbot's reasoning, which Customer / Support roles supply explicitly.

Comparison¶

Pattern	Agents	Passes	Cost	Strength
Direct prompting	1	1	1×	cheap baseline
Self-reflection	1	2	~2×	catches first-pass biases the LLM can self-identify
patterns/multi-agent-debate-evaluation	3	3 (2 parallel + judge)	~3×	different-perspective errors; adversarial surfacing
patterns/drafter-evaluator-refinement-loop	2+	iterative	3×–10×	refines the artefact, not the judgement

Tradeoffs / gotchas¶

Reflection can amplify biases instead of fixing them. If the LLM's Pass-1 bias is systematic (e.g. always lenient), a reflection pass prompted generically ("are there any biases?") may rationalise rather than correct. Reflection prompts need to be specific — "consider whether you over-weighted tone over correctness" — to avoid confabulation.
2× latency + cost vs. direct prompting, with no parallelism opportunity (Pass 2 depends on Pass 1).
On hard multi-turn reasoning tasks, debate dominates. If the Pass-1 error requires a counterargument the single LLM isn't inclined to make on its own, reflection won't reach it.

Pairs well with¶

patterns/human-aligned-criteria-refinement-loop — the reflection prompt itself can be tuned via human-LLM alignment loops on cases where Pass 1 was wrong.
concepts/binary-vs-graded-llm-scoring — reflection can flip a binary verdict cleanly; revising an integer on a graded scale is less crisp.

Seen in¶

sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm — Instacart LACE benchmarked reflection as an alternative evaluation engine before choosing multi-agent debate for production. Cited prior art: Madaan et al. 2022 (arXiv:2212.09561), Jang 2023 (blog), Madaan et al. 2024 (arXiv:2406.10400).