Skip to content

PATTERN Cited by 1 source

LLM-only code migration pipeline

Pattern

For bulk code-migration tasks where two APIs, libraries, or frameworks differ in ways a human can enumerate but codemod-authoring would be prohibitively expensive per edge case, encode the differences as a frozen prompt (interface + mapping + examples) and transform each file with a single LLM call — no AST pre-pass, no runtime judge-LLM loop.

Forces

  • Codemod exhaustion. Per-edge-case AST rules are costly to author and brittle to maintain. If the source and target APIs have dozens of parallel concepts each with minor naming differences, codemod surface area explodes.
  • LLM contextual intelligence. Modern LLMs (GPT-4o in 2024) can handle syntactic variations a strict codemod can't (e.g. `const Header = Typography.Headline;

` — alias resolution across bindings). See concepts/encoded-domain-expertise. - Limited domain. For a specific library-to-library migration (not a general framework rewrite), the transformation space is bounded — the mapping fits in a prompt. - Visual-equivalence mapping can't be derived from source code. AST codemods have the same problem; they solve it by encoding the mapping into explicit rules. LLM-only pipelines encode the same mapping into the prompt's mapping layer; the authoring cost is roughly equivalent, but the application layer (LLM inference vs codemod execution) differs dramatically in coverage of long-tail edge cases.

Mechanism

  1. Offline prompt development. Iterate on sample components through offline prompt methodology experiments until accuracy is acceptable on the sample set.
  2. Freeze the prompt. The converged structure is Interface + Mapping + Examples (see concepts/prompt-interface-mapping-examples-composition) plus a system prompt fixing the model's role and output format.
  3. Partition components into logical groups sized to keep each group's prompt in the accuracy sweet spot (Zalando: 40–50K tokens; see concepts/logical-component-grouping-for-context-budget).
  4. For each file × each relevant group: send the static prefix (cacheable) + the dynamic file content; parse the output from the opaque fence (see concepts/opaque-output-format-fencing); write the file.
  5. Temperature=0 for reproducibility (see concepts/temperature-zero-for-deterministic-codegen).
  6. Prompt-regression tests in CI (see concepts/llm-generated-prompt-regression-test) gate prompt / model changes.
  7. Human code review + visual testing catches the residual failures. "Code reviews and thorough visual testing would be needed for catching subtle issues that LLMs might introduce."

Contrast with hybrid and whole-rewrite patterns

Pattern AST layer? LLM role Scale
LLM-only (this pattern) No Transform + resolve edge cases Bulk file-level migration
AST+LLM hybrid Yes Residue cleanup + hallucination control Bulk file-level migration
AI-driven framework rewrite No Full implementation from spec Whole-codebase rewrite

Zalando chose LLM-only over AST+LLM not because AST-based codemods were evaluated and rejected — the article doesn't mention a deep codemod evaluation — but because their iteration-1 experiment pointed at structural prompt discipline (pre-compute interface + mapping) as cheaper than writing and maintaining an AST-based rule set over two divergent libraries. The load-bearing decision is that the mapping layer (authored by humans) is the hard part, and it's equally hard whether you apply it via LLM or AST. LLM inference covers the long tail better; AST execution is deterministic and faster.

Consequences

Positive:

  • Low codebase footprint. No AST toolchain to maintain. Zalando's tool is a Python script wrapping one library.
  • Long-tail edge case coverage. LLM handles patterns codemods would miss (alias resolution, styling variance, import normalisation).
  • Prompt is the config. Updating for new library versions is a prompt edit, not a rule-code change.
  • Cheap at small scale. ~$40/repo for a 15-app Zalando migration.

Negative:

  • ~90% accuracy, not 100%. The residual 10% lands in manual review. For a 15-app migration this is tractable; for an open-source library migration of 10,000 downstream users, the per-user tail becomes a support burden.
  • Visual-equivalence information still has to come from humans. No free lunch — the pattern just moves the work from AST-rule-authoring to mapping-verification.
  • "Moody" LLM behaviour persists. Temperature=0 helps but doesn't eliminate provider-side variance.
  • Per-file cost scales linearly. Unlike codemods (once written, near-free to run), every file costs GPT-4o tokens. Prompt caching helps significantly.
  • Quality ceiling is model-dependent. The 90% is a GPT-4o-in-2024 number. Future models likely raise it; older models would do worse.

Production exemplar

Zalando's Component Migration Toolkit is the canonical wiki instance — applied across 15 B2B applications at the Partner Tech department, September 2024 onward.

Seen in

Last updated · 501 distilled / 1,218 read