Skip to content

CONCEPT Cited by 1 source

Machine translation with LLMs

Definition

Machine translation with LLMs (MT-via-LLM) is the use of general-purpose large language models as the translation engine for production localization systems, replacing purpose-built statistical or neural MT engines (Google Translate, DeepL, OpenNMT etc). The shift is not "swap the API" — it changes the system shape, because LLM translation quality and reliability are controlled through prompting, iteration, and evaluation loops rather than through model-internal choices.

Why one-shot MT-via-LLM fails at production scale

Lyft's 2026-02 post enumerates three failure modes of the naive "send English string, receive translated string" API call (Source: sources/2026-02-19-lyft-scaling-localization-with-ai):

  1. No nuance. "Single-shot translations are often 'correct enough' but rarely optimal for brand voice or regional idioms." Classical MT systems have the same problem, but LLMs are usually picked because the team wants brand-voice control — so single-shot defeats the choice.
  2. No quality signal. "Without evaluation, there's no way to know if a translation is acceptable before shipping it to users." Classical MT APIs return a confidence score; LLM APIs typically don't — evaluation must be added as a separate architectural layer.
  3. No recovery path. "When a translation fails validation, the system has no mechanism to try again with corrective feedback." Without a feedback loop, bad translations stay bad.

The shape the architecture takes

Once one-shot is rejected, the shipped architecture becomes an iterative generator-critique-refine loop — the same shape as image-generation quality gates (patterns/vlm-evaluator-quality-gate) and agent-plan refinement (concepts/iterative-plan-refinement). The text-MT instance is named patterns/drafter-evaluator-refinement-loop on the wiki; Lyft's systems/lyft-ai-localization-pipeline is the canonical implementation.

Operational shape:

  • Drafter generates N candidate translations per source string (Lyft: N=3).
  • Evaluator grades candidates on a project rubric (accuracy/clarity, fluency/adaptation, brand alignment, technical correctness).
  • Loop with bounded attempts (Lyft: 3) on all-fail.
  • Prompt arguments include glossary (brand terms) and placeholder list ({vehicle_type} etc) to preserve terminology + i18n message variables.

Tradeoffs vs classical MT

Axis Classical MT (Google/DeepL/OpenNMT) MT-via-LLM
Latency (single call) ~50-100 ms ~1-10 s
Cost per string fraction of a cent cents (esp. with retries)
Brand voice control external glossary, limited inline in prompt, full
Quality evaluation API confidence score must be engineered (judge LLM)
Recovery on bad output re-translate with different API critique-and-refine loop
Rare language coverage training-data limited LLM-capability limited
Structural guarantees no {placeholder} awareness must be enforced via prompt + validator

For Lyft's UX-string case, the structural + brand-voice controls dominate — LLM wins despite the latency/cost premium, because iteration is done offline per string, not per user request.

What LLM-MT inherits from general LLM architecture

  • concepts/self-approval-bias — single-model LLM-MT where the same model generates and evaluates is biased toward approving its own output. Lyft's post is the canonical articulation: use different models/agents for the two roles.
  • concepts/structured-output-reliability — LLM-MT outputs that go into a programmatic consumer (translation store, validation pipeline, i18n tooling) are subject to the same "malformed = fully incorrect" discipline as judge outputs.
  • concepts/pydantic-structured-llm-output — in Python LLM-MT pipelines, Pydantic schemas are the contract surface between generator and evaluator agents.
  • concepts/drafter-expert-split — Lyft's fast-drafter / capable-evaluator pair is the same two-model primitive, applied at the translation-task layer rather than the inference layer where speculative decoding lives.

Seen in

  • sources/2026-02-19-lyft-scaling-localization-with-ai — canonical wiki instance; Lyft names the three failure modes of one-shot LLM-MT and ships the iterative architecture as the response. Industry-wide this shape is broadly adopted (internal systems at many large consumer products) but Lyft's post is the first on-wiki architectural write-up.
Last updated · 319 distilled / 1,201 read