PATTERN Cited by 1 source
Self-reflection LLM evaluation¶
Intent¶
Improve an LLM judge's verdict by running two sequential passes with the same agent:
- Pass 1 — Initial evaluation. Score the item against the criterion, same as direct prompting.
- Pass 2 — Self-reflection. The LLM inspects its own Pass 1 output, identifies potential biases or gaps, and refines its verdict.
The mechanism is meant to let the model "catch and correct its own limitations" and deliver more nuanced assessments than a single pass would produce (Source: sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm, citing Madaan et al. 2022 arXiv:2212.09561, Jang 2023, Madaan et al. 2024 arXiv:2406.10400).
Structure¶
Input
│
▼
┌──────────────┐
│ LLM judge │ initial score + rationale
│ (pass 1) │
└──────┬───────┘
│
▼
┌──────────────┐
│ LLM judge │ reflects on pass 1 →
│ (pass 2: │ identifies bias / gaps →
│ reflection) │ revised score + rationale
└──────┬───────┘
│
▼
Final verdict
Unlike patterns/multi-agent-debate-evaluation, there's only one agent — the same LLM examines its own prior output. No role specialisation, no adversarial structure.
When to use¶
- Intermediate complexity criteria where direct prompting misses nuance but multi-agent debate's 3× cost is unjustified.
- When the Pass-1 failure mode is typified biases (recency bias, length bias, confident-wrong) that a reflection prompt can surface without needing a separate adversary.
- As a cost-bounded upgrade path from direct prompting before committing to debate.
LACE's verdict¶
Instacart's LACE benchmarked reflection against direct prompting and debate. They picked debate as the production default for context-dependent and subjective criteria. Implication — reflection was not sufficient on its own for LACE's hardest criteria. The post doesn't publish a reflection-vs- debate win-rate matrix but positions debate as "highly effective" for the context-dependent cases.
This is consistent with the broader pattern: self-reflection helps with first-pass errors the LLM itself can identify. It's weaker than debate when the error requires a different perspective — e.g. Instacart's "shopper used a company-authorized card, not the user's digital wallet" requires a reader who actively pushes back on the chatbot's reasoning, which Customer / Support roles supply explicitly.
Comparison¶
| Pattern | Agents | Passes | Cost | Strength |
|---|---|---|---|---|
| Direct prompting | 1 | 1 | 1× | cheap baseline |
| Self-reflection | 1 | 2 | ~2× | catches first-pass biases the LLM can self-identify |
| patterns/multi-agent-debate-evaluation | 3 | 3 (2 parallel + judge) | ~3× | different-perspective errors; adversarial surfacing |
| patterns/drafter-evaluator-refinement-loop | 2+ | iterative | 3×–10× | refines the artefact, not the judgement |
Tradeoffs / gotchas¶
- Reflection can amplify biases instead of fixing them. If the LLM's Pass-1 bias is systematic (e.g. always lenient), a reflection pass prompted generically ("are there any biases?") may rationalise rather than correct. Reflection prompts need to be specific — "consider whether you over-weighted tone over correctness" — to avoid confabulation.
- 2× latency + cost vs. direct prompting, with no parallelism opportunity (Pass 2 depends on Pass 1).
- On hard multi-turn reasoning tasks, debate dominates. If the Pass-1 error requires a counterargument the single LLM isn't inclined to make on its own, reflection won't reach it.
Pairs well with¶
- patterns/human-aligned-criteria-refinement-loop — the reflection prompt itself can be tuned via human-LLM alignment loops on cases where Pass 1 was wrong.
- concepts/binary-vs-graded-llm-scoring — reflection can flip a binary verdict cleanly; revising an integer on a graded scale is less crisp.
Seen in¶
- sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm — Instacart LACE benchmarked reflection as an alternative evaluation engine before choosing multi-agent debate for production. Cited prior art: Madaan et al. 2022 (arXiv:2212.09561), Jang 2023 (blog), Madaan et al. 2024 (arXiv:2406.10400).