PATTERN Cited by 1 source
Drafter-Evaluator refinement loop¶
Intent¶
Interpose a reasoning-focused Evaluator agent between a fast generative Drafter agent and ship-to-user, so the Drafter's output is never shipped without passing a rubric. On rubric failure, feed the Evaluator's per-candidate critique back into the Drafter for another attempt, up to a bounded retry count.
The pattern is the text / structured-output sibling of VLM-evaluator quality gate (image modality) and the task-layer sibling of planner-coder- verifier-router loop (agent-plan refinement). All three share the loop shape; they differ in output modality and model class.
Mechanism¶
input ──► DRAFTER (N candidates) ──► EVALUATOR (rubric)
▲ │
│ ▼
│ ┌──────────────────────────────┐
│ │ any pass → ship best │
│ │ all fail → feed critique ──►┘
│ up to K retries
└────────────────────────────── refine
Four-step canonical loop (Lyft's AI localization pipeline — Source: sources/2026-02-19-lyft-scaling-localization-with-ai):
- Drafter produces N candidates (Lyft: N=3) from a prompt that carries task context + constraints (glossary, placeholders, style hints).
- Evaluator grades each candidate against a project-
specific rubric (for Lyft translation: accuracy/clarity,
fluency/adaptation, brand alignment, technical correctness).
Grade per candidate:
pass|revise+ explanation. - Decide: if any candidate passes → Evaluator selects best, ship; else → continue.
- Refine: feed the per-candidate critique text back into the Drafter prompt; Drafter re-generates N candidates addressing the critique. Return to step 2 until pass or retry budget exhausted (Lyft: 3 attempts total).
Why the two roles must be different¶
(Directly from Lyft's post; four architecture-determining reasons:)
- Easier Evaluation. "Spotting errors is simpler than perfect generation, so the Evaluator doesn't need to be a flawless translator."
- Context Preservation. "The original translator retains the reasoning for its choices when refining based on feedback" — same drafter seeing own prior candidates + critique is better at course correction than a fresh drafter.
- Bias Avoidance. "Separating roles prevents the self- approval bias of a single model translating and evaluating its own work." See concepts/self-approval-bias.
- Flexibility / Cost. Different model tiers — Drafter cheap + fast (generation is easier), Evaluator capable + reasoning-focused (eval is harder, but runs on many fewer tokens than generation).
Reasons 2 and 4 are structural. Reasons 1 and 3 are the independence arguments that block the "let one model do both" collapse.
Structured handoff between agents¶
Output of each agent is a typed object —
Pydantic BaseModel in the Python case. Drafter emits
DrafterOutput(candidates: list[TranslationCandidate]);
Evaluator emits EvaluatorOutput(evaluations: list[
CandidateEvaluation], best_candidate_index: int | None).
Free-form text passing between agents causes parse failures and
defeats the architecture; see
concepts/pydantic-structured-llm-output and
concepts/structured-output-reliability.
Why feed critique back beats discard-and-retry¶
Discard-and-retry samples the generator's distribution again with the same prompt — progress depends on stochastic luck. Feeding the per-candidate critique text into the next prompt shifts the distribution itself toward outputs that address the failing dimension. The Evaluator's failure signal is the gradient; the Drafter is the optimiser.
Round-budget discipline¶
A retry cap is load-bearing, for two reasons:
- Cost/latency bounding — each retry round is 1 Drafter call (generating N candidates) + 1 Evaluator call. At 3 retries × N=3 = up to 12 LLM calls per source string in the worst case. Without a cap, pathological inputs run unboundedly.
- Diminishing returns — Lyft's post: "iterative refinement yields the largest gains in the first 1–2 cycles, so the three-attempt limit balances quality improvement against latency and cost."
See concepts/refinement-round-budget for the generalised concept (DS-STAR's 10-round cap is the adjacent instance at the agent-plan layer).
Tradeoffs / gotchas¶
- Cost compounds with retry count. Worst-case is retries × (drafter_call + evaluator_call); even the 1-retry median case is 2–3× a single-shot LLM call.
- All-fail terminal behaviour is under-specified. Lyft's post stops at "up to three times" — it does not describe what happens if all three attempts fail (human escalation? ship best-grade candidate even without pass? skip the string?). This is a product UX gap that the architecture doesn't commit to.
- Evaluator calibration dominates downstream quality. An over-permissive evaluator ships bad output; an over-strict evaluator exhausts budget on acceptable outputs. Human- sampled calibration of evaluator decisions is operational work this pattern does not remove.
- Rubric drift. New failure modes require rubric updates; rubric updates invalidate prior evaluator calibration. The evaluator rubric is a long-lived asset.
- Judge bias persists even with separation. Separating roles breaks self-approval bias (see concepts/self-approval-bias) but the evaluator has its own length/confidence/verbosity biases (see concepts/llm-as-judge).
- Not a safety proof. Passing the rubric ≠ safe to ship; legal / brand / trademark / PII checks still need out-of- band gates.
- Drafter-vs-Evaluator model tier choice is empirical. "Fast non-reasoning" vs "reasoning-focused" is Lyft's framing, not a model recommendation. Replicating the architecture requires picking specific tiers for the application.
Modality variants on the wiki¶
- Text-translation — systems/lyft-ai-localization-pipeline (canonical text instance, this pattern).
- Image-generation — patterns/vlm-evaluator-quality-gate / systems/instacart-pixel (VLM judge; 20% → 85% human approval uplift reported).
- Agent-plan refinement — [[patterns/planner-coder-verifier- router-loop]] / systems/ds-star (verifier agent; 10-round budget).
- Structured extraction — systems/instacart-parse / concepts/llm-self-verification uses a confidence score instead of a judge-LLM; different shape but same "don't ship without a quality gate" intent.
Seen in¶
- sources/2026-02-19-lyft-scaling-localization-with-ai — canonical wiki text-translation instance. Lyft's AI localization pipeline ships the four-step loop with N=3 candidates and a 3-attempt retry cap. The post defends the Drafter/Evaluator separation with the four-reason argument above. No before/after quality numbers published (contrast with Instacart PIXEL's 20% → 85% figure on the VLM sibling).
Related¶
- concepts/iterative-prompt-refinement
- concepts/llm-as-judge
- concepts/self-approval-bias
- concepts/drafter-expert-split
- concepts/refinement-round-budget
- concepts/structured-output-reliability
- concepts/machine-translation-with-llms
- patterns/multi-candidate-generation
- patterns/vlm-evaluator-quality-gate
- patterns/planner-coder-verifier-router-loop
- systems/lyft-ai-localization-pipeline
- systems/instacart-pixel