Skip to content

PATTERN Cited by 1 source

Drafter-Evaluator refinement loop

Intent

Interpose a reasoning-focused Evaluator agent between a fast generative Drafter agent and ship-to-user, so the Drafter's output is never shipped without passing a rubric. On rubric failure, feed the Evaluator's per-candidate critique back into the Drafter for another attempt, up to a bounded retry count.

The pattern is the text / structured-output sibling of VLM-evaluator quality gate (image modality) and the task-layer sibling of planner-coder- verifier-router loop (agent-plan refinement). All three share the loop shape; they differ in output modality and model class.

Mechanism

input ──► DRAFTER (N candidates) ──► EVALUATOR (rubric)
             ▲                          │
             │                          ▼
             │            ┌──────────────────────────────┐
             │            │ any pass  → ship best        │
             │            │ all fail  → feed critique ──►┘
             │                                    up to K retries
             └────────────────────────────── refine

Four-step canonical loop (Lyft's AI localization pipeline — Source: sources/2026-02-19-lyft-scaling-localization-with-ai):

  1. Drafter produces N candidates (Lyft: N=3) from a prompt that carries task context + constraints (glossary, placeholders, style hints).
  2. Evaluator grades each candidate against a project- specific rubric (for Lyft translation: accuracy/clarity, fluency/adaptation, brand alignment, technical correctness). Grade per candidate: pass | revise + explanation.
  3. Decide: if any candidate passes → Evaluator selects best, ship; else → continue.
  4. Refine: feed the per-candidate critique text back into the Drafter prompt; Drafter re-generates N candidates addressing the critique. Return to step 2 until pass or retry budget exhausted (Lyft: 3 attempts total).

Why the two roles must be different

(Directly from Lyft's post; four architecture-determining reasons:)

  1. Easier Evaluation. "Spotting errors is simpler than perfect generation, so the Evaluator doesn't need to be a flawless translator."
  2. Context Preservation. "The original translator retains the reasoning for its choices when refining based on feedback" — same drafter seeing own prior candidates + critique is better at course correction than a fresh drafter.
  3. Bias Avoidance. "Separating roles prevents the self- approval bias of a single model translating and evaluating its own work." See concepts/self-approval-bias.
  4. Flexibility / Cost. Different model tiers — Drafter cheap + fast (generation is easier), Evaluator capable + reasoning-focused (eval is harder, but runs on many fewer tokens than generation).

Reasons 2 and 4 are structural. Reasons 1 and 3 are the independence arguments that block the "let one model do both" collapse.

Structured handoff between agents

Output of each agent is a typed object — Pydantic BaseModel in the Python case. Drafter emits DrafterOutput(candidates: list[TranslationCandidate]); Evaluator emits EvaluatorOutput(evaluations: list[ CandidateEvaluation], best_candidate_index: int | None). Free-form text passing between agents causes parse failures and defeats the architecture; see concepts/pydantic-structured-llm-output and concepts/structured-output-reliability.

Why feed critique back beats discard-and-retry

Discard-and-retry samples the generator's distribution again with the same prompt — progress depends on stochastic luck. Feeding the per-candidate critique text into the next prompt shifts the distribution itself toward outputs that address the failing dimension. The Evaluator's failure signal is the gradient; the Drafter is the optimiser.

Round-budget discipline

A retry cap is load-bearing, for two reasons:

  • Cost/latency bounding — each retry round is 1 Drafter call (generating N candidates) + 1 Evaluator call. At 3 retries × N=3 = up to 12 LLM calls per source string in the worst case. Without a cap, pathological inputs run unboundedly.
  • Diminishing returns — Lyft's post: "iterative refinement yields the largest gains in the first 1–2 cycles, so the three-attempt limit balances quality improvement against latency and cost."

See concepts/refinement-round-budget for the generalised concept (DS-STAR's 10-round cap is the adjacent instance at the agent-plan layer).

Tradeoffs / gotchas

  • Cost compounds with retry count. Worst-case is retries × (drafter_call + evaluator_call); even the 1-retry median case is 2–3× a single-shot LLM call.
  • All-fail terminal behaviour is under-specified. Lyft's post stops at "up to three times" — it does not describe what happens if all three attempts fail (human escalation? ship best-grade candidate even without pass? skip the string?). This is a product UX gap that the architecture doesn't commit to.
  • Evaluator calibration dominates downstream quality. An over-permissive evaluator ships bad output; an over-strict evaluator exhausts budget on acceptable outputs. Human- sampled calibration of evaluator decisions is operational work this pattern does not remove.
  • Rubric drift. New failure modes require rubric updates; rubric updates invalidate prior evaluator calibration. The evaluator rubric is a long-lived asset.
  • Judge bias persists even with separation. Separating roles breaks self-approval bias (see concepts/self-approval-bias) but the evaluator has its own length/confidence/verbosity biases (see concepts/llm-as-judge).
  • Not a safety proof. Passing the rubric ≠ safe to ship; legal / brand / trademark / PII checks still need out-of- band gates.
  • Drafter-vs-Evaluator model tier choice is empirical. "Fast non-reasoning" vs "reasoning-focused" is Lyft's framing, not a model recommendation. Replicating the architecture requires picking specific tiers for the application.

Modality variants on the wiki

Seen in

  • sources/2026-02-19-lyft-scaling-localization-with-aicanonical wiki text-translation instance. Lyft's AI localization pipeline ships the four-step loop with N=3 candidates and a 3-attempt retry cap. The post defends the Drafter/Evaluator separation with the four-reason argument above. No before/after quality numbers published (contrast with Instacart PIXEL's 20% → 85% figure on the VLM sibling).
Last updated · 319 distilled / 1,201 read