Drafter-Evaluator refinement loop¶

PATTERN Cited by 2 sources

Intent¶

Interpose a reasoning-focused Evaluator agent between a fast generative Drafter agent and ship-to-user, so the Drafter's output is never shipped without passing a rubric. On rubric failure, feed the Evaluator's per-candidate critique back into the Drafter for another attempt, up to a bounded retry count.

The pattern is the text / structured-output sibling of VLM-evaluator quality gate (image modality) and the task-layer sibling of planner-coder- verifier-router loop (agent-plan refinement). All three share the loop shape; they differ in output modality and model class.

Mechanism¶

input ──► DRAFTER (N candidates) ──► EVALUATOR (rubric)
             ▲                          │
             │                          ▼
             │            ┌──────────────────────────────┐
             │            │ any pass  → ship best        │
             │            │ all fail  → feed critique ──►┘
             │                                    up to K retries
             └────────────────────────────── refine

Four-step canonical loop (Lyft's AI localization pipeline — Source: sources/2026-02-19-lyft-scaling-localization-with-ai):

Drafter produces N candidates (Lyft: N=3) from a prompt that carries task context + constraints (glossary, placeholders, style hints).
Evaluator grades each candidate against a project- specific rubric (for Lyft translation: accuracy/clarity, fluency/adaptation, brand alignment, technical correctness). Grade per candidate: pass | revise + explanation.
Decide: if any candidate passes → Evaluator selects best, ship; else → continue.
Refine: feed the per-candidate critique text back into the Drafter prompt; Drafter re-generates N candidates addressing the critique. Return to step 2 until pass or retry budget exhausted (Lyft: 3 attempts total).

Why the two roles must be different¶

(Directly from Lyft's post; four architecture-determining reasons:)

Easier Evaluation. "Spotting errors is simpler than perfect generation, so the Evaluator doesn't need to be a flawless translator."
Context Preservation. "The original translator retains the reasoning for its choices when refining based on feedback" — same drafter seeing own prior candidates + critique is better at course correction than a fresh drafter.
Bias Avoidance. "Separating roles prevents the self- approval bias of a single model translating and evaluating its own work." See concepts/self-approval-bias.
Flexibility / Cost. Different model tiers — Drafter cheap + fast (generation is easier), Evaluator capable + reasoning-focused (eval is harder, but runs on many fewer tokens than generation).

Reasons 2 and 4 are structural. Reasons 1 and 3 are the independence arguments that block the "let one model do both" collapse.

Structured handoff between agents¶

Output of each agent is a typed object — Pydantic BaseModel in the Python case. Drafter emits DrafterOutput(candidates: list[TranslationCandidate]); Evaluator emits EvaluatorOutput(evaluations: list[ CandidateEvaluation], best_candidate_index: int | None). Free-form text passing between agents causes parse failures and defeats the architecture; see concepts/pydantic-structured-llm-output and concepts/structured-output-reliability.

Why feed critique back beats discard-and-retry¶

Discard-and-retry samples the generator's distribution again with the same prompt — progress depends on stochastic luck. Feeding the per-candidate critique text into the next prompt shifts the distribution itself toward outputs that address the failing dimension. The Evaluator's failure signal is the gradient; the Drafter is the optimiser.

Round-budget discipline¶

A retry cap is load-bearing, for two reasons:

Cost/latency bounding — each retry round is 1 Drafter call (generating N candidates) + 1 Evaluator call. At 3 retries × N=3 = up to 12 LLM calls per source string in the worst case. Without a cap, pathological inputs run unboundedly.
Diminishing returns — Lyft's post: "iterative refinement yields the largest gains in the first 1–2 cycles, so the three-attempt limit balances quality improvement against latency and cost."

See concepts/refinement-round-budget for the generalised concept (DS-STAR's 10-round cap is the adjacent instance at the agent-plan layer).

Tradeoffs / gotchas¶

Cost compounds with retry count. Worst-case is retries × (drafter_call + evaluator_call); even the 1-retry median case is 2–3× a single-shot LLM call.
All-fail terminal behaviour is under-specified. Lyft's post stops at "up to three times" — it does not describe what happens if all three attempts fail (human escalation? ship best-grade candidate even without pass? skip the string?). This is a product UX gap that the architecture doesn't commit to.
Evaluator calibration dominates downstream quality. An over-permissive evaluator ships bad output; an over-strict evaluator exhausts budget on acceptable outputs. Human- sampled calibration of evaluator decisions is operational work this pattern does not remove.
Rubric drift. New failure modes require rubric updates; rubric updates invalidate prior evaluator calibration. The evaluator rubric is a long-lived asset.
Judge bias persists even with separation. Separating roles breaks self-approval bias (see concepts/self-approval-bias) but the evaluator has its own length/confidence/verbosity biases (see concepts/llm-as-judge).
Not a safety proof. Passing the rubric ≠ safe to ship; legal / brand / trademark / PII checks still need out-of- band gates.
Drafter-vs-Evaluator model tier choice is empirical. "Fast non-reasoning" vs "reasoning-focused" is Lyft's framing, not a model recommendation. Replicating the architecture requires picking specific tiers for the application.

Modality variants on the wiki¶

Text-translation — systems/lyft-ai-localization-pipeline (canonical text instance, this pattern).
Image-generation — patterns/vlm-evaluator-quality-gate / systems/instacart-pixel (VLM judge; 20% → 85% human approval uplift reported).
Agent-plan refinement — [[patterns/planner-coder-verifier- router-loop]] / systems/ds-star (verifier agent; 10-round budget).
Structured extraction — systems/instacart-parse / concepts/llm-self-verification uses a confidence score instead of a judge-LLM; different shape but same "don't ship without a quality gate" intent.

Seen in¶

sources/2026-02-19-lyft-scaling-localization-with-ai — canonical wiki text-translation instance. Lyft's AI localization pipeline ships the four-step loop with N=3 candidates and a 3-attempt retry cap. The post defends the Drafter/Evaluator separation with the four-reason argument above. No before/after quality numbers published (contrast with Instacart PIXEL's 20% → 85% figure on the VLM sibling).
sources/2025-12-01-slack-streamlining-security-investigations-with-agents — security-investigation variant with a third (Director) layer. Slack's Spear follows the Expert-then-Critic shape at the per-round level: N Experts produce findings, the Critic scores them + condenses them into a timeline. The crucial architectural difference: the Critic does not gate the output for retry. Instead, a Director agent reads the condensed + scored timeline and decides what to do — progress the investigation, pivot to a new focus, or conclude. Canonical emergent-behaviour payoff: when the Expert missed a credential exposure and the Critic flagged it, the Director pivoted the investigation to the credential issue rather than re-asking the Expert the original question. This third layer is what distinguishes investigation-loop architectures from refinement-loop architectures — refinement retries on the same task; investigation changes the task based on what the Critic surfaces. See patterns/director-expert-critic-investigation-loop for the full pattern.