Skip to content

PATTERN Cited by 1 source

Multi-candidate generation

Intent

Generate N candidate outputs per input (instead of one), then let a downstream selector — typically a judge LLM, a rubric, or a user — pick the best. Trades N× generation cost for a higher probability that at least one output is acceptable, exploiting the variance of sampling-based generation.

The pattern is a subroutine of larger architectures, most commonly drafter- evaluator refinement loops (where N candidates feed a judge) and best-of-N sampling in agent harnesses.

Mechanism

input ──► generator ──► candidate_1
                    ──► candidate_2
                    ...
                    ──► candidate_N
                       selector ──► chosen output
  • Generator is temperature > 0 (or explicit n=N / multi- sample API) so candidates are genuinely distinct.
  • N is small — Lyft uses N=3; in LLM practice N is typically 2–10. Beyond that, returns diminish rapidly and cost grows linearly.
  • Selector is external (rubric, judge, human, heuristic); the pattern doesn't prescribe it. What matters is that selection is a separate operation from generation.

Why multiple candidates beat one

A single LLM output at temperature 0 converges on the most- likely-phrasing per the model's training distribution. For open-ended tasks with no unique best answer (translation, creative writing, summarisation, UI copy), the most-likely phrasing is rarely the phrasing that best fits the specific context — brand voice, register, audience, UI constraint.

Lyft's framing (Source: sources/2026-02-19-lyft-scaling-localization-with-ai): "A single translation often converges on the most likely phrasing, which may not be optimal for Lyft's brand voice or the specific UI context. Multiple candidates increase the probability that at least one captures the right tone, handles edge cases correctly, and uses terminology naturally."

Under an N=3 regime with a judge scoring candidates, the effective quality is roughly max_{i=1..N} quality(c_i) — a diversity-weighted upper envelope rather than the single- sample mean.

Sizing N

Diminishing returns set in quickly:

N Cost multiplier Qualitative gain
1 baseline
2 modest — covers "obvious" variation
3 Lyft's choice; knee of the curve for translation
5 further lift on high-variance tasks (creative)
10+ 10×+ rare; reserved for benchmark settings or strict best-of-N

The exact knee is task-dependent and empirical. Lyft's N=3 is "three distinct candidates for every source string" — the post does not publish an ablation across N values.

Relationship to selector

  • Judge LLM / rubric (concepts/llm-as-judge) — selector is itself an LLM scoring each candidate against a rubric; this is the core of the drafter-evaluator architecture.
  • Pass-at-k evaluation (concepts/pass-at-k) — in code- generation benchmarks, "any candidate passes the test" is the selector; same multi-candidate subroutine, different selector.
  • Human selector — the most expensive selector, used in creative tools where taste is the evaluator.
  • Heuristic selector — e.g. log-prob score, length constraints; cheap but weak.

Tradeoffs / gotchas

  • Cost is linear in N. Each candidate is a full generation call. N=3 is 3× cost; there's no shared-prefix discount unless the generation API supports it explicitly.
  • Candidates can be nearly identical. At low temperature or on low-entropy inputs, the N candidates converge; the pattern wastes cost. Higher temperature mitigates but introduces its own quality variance.
  • The selector is the new bottleneck. Bad selector ⇒ multi-candidate doesn't help; good selector ⇒ multi-candidate compounds well. Selector calibration is load-bearing.
  • Not appropriate for deterministic tasks. Factual lookups, code with a unique correct implementation, strict classification — here single-candidate with ground truth beats multi-candidate-with-judge.
  • N and retry budget interact multiplicatively. In a Drafter-Evaluator loop with retries, worst-case cost is N * (retries + 1) generations + (retries + 1) judge calls. Lyft's N=3, retries=3 ⇒ up to 12 generations.

Seen in

  • sources/2026-02-19-lyft-scaling-localization-with-aicanonical wiki instance. Lyft's AI localization pipeline generates 3 distinct translation candidates per source string, then the Evaluator picks the best or rejects all. The "why three?" argument in the post is the clearest wiki articulation of the multi-candidate rationale: "multiple candidates increase the probability that at least one captures the right tone."
Last updated · 319 distilled / 1,201 read