Skip to content

SYSTEM Cited by 1 source

Lyft AI Localization Pipeline

Overview

The Lyft AI Localization Pipeline is Lyft's LLM-based machine- translation system for localizing UI strings across the Lyft app's supported languages / regions. It replaces the naive "single LLM API call, in English out translated" approach with a two-agent iterative loop: a Drafter generates multiple candidate translations per source string, an Evaluator scores each candidate on a project-specific rubric and selects the best (or on all-fail, feeds critique back to the drafter for another attempt, up to three total) (Source: sources/2026-02-19-lyft-scaling-localization-with-ai).

The pipeline is the canonical text-translation instance on the wiki of the generator-evaluator-refine loop primitive. Its image- generation sibling is Instacart PIXEL; its structured-extraction sibling is Instacart PARSE; its agent-loop sibling is Google DS-STAR.

Architecture

source string + language + country + glossary + placeholders
   ┌─────────────┐
   │  DRAFTER    │  fast non-reasoning model
   │  (N=3)      │  → List[TranslationCandidate]
   └─────┬───────┘
   ┌─────────────┐
   │ EVALUATOR   │  reasoning-focused model
   │  4-dim rubric│ → List[CandidateEvaluation] + best_index
   └──┬───────┬──┘
      │       │
  any pass   all fail ─► feed critique back to Drafter
      │                    (up to 3 attempts)
 ship best candidate

Drafter

  • Role: generate N=3 distinct candidate translations per source string.
  • Model tier: fast, non-reasoning; rationale that "translation is primarily a generative task where standard models already perform very well" and the lower cost/latency enables iteration.
  • Prompt skeleton (from the post):
    You are a professional translator for Lyft.
    Translate into {language} for {country}.
    Give {num_translations} translations of the following text.
    
    GLOSSARY: {glossary}
    PLACEHOLDERS (preserve exactly): {placeholders}
    
    Text: {source_text}
    
  • Output shape: Pydantic DrafterOutput(candidates: list[TranslationCandidate]).
  • Why three? "A single translation often converges on the most likely phrasing, which may not be optimal for Lyft's brand voice or the specific UI context. Multiple candidates increase the probability that at least one captures the right tone, handles edge cases correctly, and uses terminology naturally."

Evaluator

  • Role: strict quality gate — grade each candidate + pick best or reject all.
  • Model tier: reasoning-focused — "analytical comparison: checking source versus target for semantic drift, verifying terminology compliance, catching subtle tone mismatches."
  • Rubric (4 dimensions):
Dimension Question
Accuracy & Clarity Preserves full meaning, unambiguous?
Fluency & Adaptation Reads natural to a native speaker; culturally appropriate for target region?
Brand Alignment Uses official Lyft terminology; proper nouns, airport codes, brand names preserved in English?
Technical Correctness Free of spelling/grammar errors; Lyft terms/phrases applied correctly?
  • Grade per candidate: pass | revise + explanation.
  • Output shape: EvaluatorOutput(evaluations, best_candidate_index).
  • All-fail path: feed the per-candidate critique text back to the Drafter to produce a revised N=3 candidate set; repeat up to three attempts.

Retry / self-correction

  • Cap: three attempts per source string.
  • Rationale (from the post): "Iterative refinement yields the largest gains in the first 1–2 cycles, so the three-attempt limit balances quality improvement against latency and cost."
  • Canonical wiki text-translation instance of concepts/refinement-round-budget (DS-STAR's 10-round agent-loop cap is the numerical cousin at a different layer).

Structured output via Pydantic

Every inter-agent message (drafter→evaluator, evaluator→drafter, final output) is a Pydantic-typed object, not free text. "This ensures type safety, reliable parsing, and clear contracts between Drafter and Evaluator."

See concepts/pydantic-structured-llm-output for the general pattern; see concepts/structured-output-reliability for the wider reliability argument (Dropbox Dash framed malformed output as fully incorrect).

Design rationale

The post enumerates four reasons the Drafter and Evaluator must be separate:

  1. Easier Evaluation. "Spotting errors is simpler than perfect generation, so the Evaluator doesn't need to be a flawless translator."
  2. Context Preservation. "The original translator retains the reasoning for its choices when refining based on feedback."
  3. Bias Avoidance. "Separating roles prevents the self- approval bias of a single model translating and evaluating its own work." Canonical articulation of concepts/self-approval-bias on the wiki.
  4. Flexibility / Cost. Different model tiers for different jobs — fast drafter, capable evaluator.

All four are architecture-determining; (3) is the one most commonly missing from single-model MT-via-LLM systems.

Relationship to adjacent wiki primitives

Gaps / unknowns

See caveats on the source page. Notably: no model names, no convergence statistics, no quality numbers, no handling for all-three-attempts-fail terminal path, no glossary-management operational detail.

Seen in

Last updated · 319 distilled / 1,201 read