Skip to content

ZALANDO 2025-02-19

Read original ↗

Zalando — LLM-powered migration of UI component libraries

Summary

Zalando's Partner Tech department (B2B applications for retail partners) had accumulated two distinct in-house UI component libraries across 15 sophisticated B2B applications. This fragmentation caused inconsistent partner UX, duplicated design effort, and maintenance burden. In September 2024 they built a Python-based LLM-powered migration tool on the llm library's conversation API, using GPT-4o as the transformation engine, and completed the migration across the application fleet. The post canonicalises the iterative prompt-engineering methodology (five experiments on sample components) that converged on the final prompt shape — interface + mapping + examples — plus the production tool's engineering details (token-limit recovery, temperature zero, prompt-caching layout, logical component grouping, automated prompt-regression tests) and production numbers (~90% accuracy on medium-complexity components, ~$40 per repository via GPT-4o pricing, 30–200 seconds per file).

It is the wiki's canonical instance of LLM-only code migration (no AST pre-pass) at production scale — a contrast-pair partner to sources/2024-06-19-slack-ai-powered-conversion-from-enzyme-to-react-testing-library's AST+LLM hybrid and to the Cloudflare-vinext AI-driven framework rewrite — and the canonical instance of offline prompt-engineering methodology as a first-class engineering activity distinct from runtime iterative prompt refinement loops.

Key takeaways

  1. Five-iteration methodology discovery, not five-shot runtime loop. Zalando's iterations were offline experiments on sample components during an internal hackathon — each experiment evaluated a prompt strategy (source code only → interface → interface+mapping → manually-verified mapping → +examples) and the next iteration was human-authored based on the previous failure. This is distinct from iterative prompt refinement (Instacart PIXEL / Lyft AI-localization), where a judge LLM feeds failure signal back into a generator LLM at inference time. Zalando's loop is methodology discovery; the runtime loop is a single-shot migration with a frozen prompt. Both are iterative but at different timescales.

  2. Interface + Mapping + Examples is the load-bearing prompt composition. The three-layer structure converged through the experiments:

  3. Interface: LLM-generated, typed attribute list for each component in both libraries ("type: 'filled' | 'outlined' | 'link'").
  4. Mapping: explicit transformation rules per attribute ("convert variant=primary or variant=default to type='filled'"). Auto-generated from the interface, then manually verified — the visual-equivalence step that source code alone can't supply (see takeaway 3).
  5. Examples: input/output code samples with migration notes explaining the non-obvious mappings. Dropping any layer degrades accuracy — interface alone "lacked essential information" (iteration 2, low accuracy), interface+mapping hit "medium accuracy" with wrong auto-inferred mappings (iteration 3), manual verification of mappings improved basic components (iteration 4) but complex components needed the examples layer (iteration 5, "high accuracy"). "Through this series of iterative experiments, we were able to finalize our approach."

  6. Source code cannot reveal design intent; human visual verification is the load-bearing step. The iteration-3 failure named explicitly: "for the button component, LLM created direct size mappings (converting 'medium' sized button to 'medium'), when in reality, a 'medium' button in the original library was visually equivalent to a 'large' button in the new library." The LLM could see the code but could not see the render. Three reasons given: "(1) Source code cannot reveal all information, e.g. design intent or visual relationships; (2) The LLM couldn't visualize how components are rendered; (3) Different libraries implement similar concepts (like 'medium' size) differently." Fix: pair programmers + designers verify the auto-generated mapping against rendered outputs before freezing it into the prompt.

  7. Token-limit recovery via conversation API continue. Files exceeding the 4K token output limit were truncated mid-transformation. Zalando resolved this by using the conversation API and "passing 'continue' as a prompt whenever the content was cut off. This allowed the LLM to pick up where it left off and complete the transformation. As per our tests, a simple 'continue' prompt proved more reliable than more complex prompts to continue the transformation." Counter-intuitive finding: the minimal nudge outperformed elaborate re-instruction.

  8. Temperature=0 for deterministic code transformation. "Initially, we noticed varying outputs for the same input, making testing and validation challenging. Changing LLM settings, like setting the temperature parameter to 0 made the LLM's output to be more deterministic and reproducible." Reproducibility is the load-bearing property at migration-toolkit altitude — without it, prompt- validation tests flake, regression detection is impossible, and the developer loop becomes adversarial.

  9. Output-format fencing via <updatedContent> tags. "You MUST return just the transformed file inside the <updatedContent> tag like: <updatedContent>transformed- file</updatedContent> without any additional data." The system prompt's last line is the extraction contract; downstream parser strips the fence and writes the file. Same shape as Slack's <code></code> fencing in the Enzyme→RTL codemod — an industry-convergent primitive for machine-extractable LLM output in code-migration pipelines.

  10. Prompt caching is a first-class architectural concern. Zalando structured the prompt to maximise prompt cache hits: "we set up a structured prompt format that maximized cache hits. The prompt was organized to have the static part like transformation examples at top and the dynamic part (the file content) and the end, ensuring caching can be leveraged while transforming different files." Static prefix (interface + mapping + examples per component group) is byte-stable across files; dynamic suffix (file content) varies. Cache hit on the prefix for the entire group of files amortises prefill cost — canonical instance of concepts/prompt-cache-consistency at the code-migration altitude.

  11. Logical component grouping keeps context tokens in the accuracy-sweet-spot. "We observed as the input prompt size grew, the transformation accuracy declined. To maintain high quality, we organized components into logical groups (like 'form', 'core', etc.), keeping context tokens between 40-50K per group of components. This grouping strategy helped maintain the LLM's focus and improved transformation accuracy." The toolkit is run once per group per file, not once per entire library. Context-budget-as-accuracy-lever — sibling to context engineering's budget-allocation framing at the agent altitude, applied here at a single-shot transformation altitude.

  12. LLM-generated examples double as automated prompt- regression tests. "During development, we discovered that small adjustments to transformation instructions could lead to substantial changes in results. This highlighted a need to have prompt validation tests in place and led us to implement automated testing using LLM-generated examples. These examples served as validation tools and regression tests, helping us catch unexpected changes during the migration process." The same example library used in the prompt (takeaway 2) is replayed through the pipeline in CI to catch prompt drift. Explicit, named recognition that prompts are code and need tests — a concept Yelp surfaces in query understanding and Slack's Enzyme codemod gestures at but does not canonicalise.

  13. Production numbers. ~90% accuracy overall (higher on low/medium complexity, lower on components requiring substantial restructuring). ~$40 per code repository via GPT-4o pricing, using rough averages of 45K prompt tokens and 2K output tokens per file, ~10 component groups with ~3 components each, ~30 files per component group. 30–200 seconds per file processing time — not blocking for background batch runs but "made conducting quick, small-scale experiments more challenging." Prompt caching can further reduce cost. Scope: 15 sophisticated B2B applications.

  14. LLM strengths vs codemods: the contextual-intelligence argument. "LLMs have a good understanding of the different elements of code and their relationships. This was very useful in handling different edge cases encountered during the migration. This is a powerful capability, compared to traditional alternatives like codemods, where we need to explicitly code every edge case." Worked example: import {Typography} from …; const Header = Typography.Headline; <Header></Header> — the LLM correlates Typography.Headline and Header as the same thing; a codemod needs an explicit alias-resolver rule for that pattern. LLMs "were able to fill in the gaps in instructions based on provided examples and context" including correct default values when instructions were silent.

  15. LLM limits: no visual understanding, "moody" outputs, time cost. (a) "LLMs are unable to verify visual implications of the changes when migrating between design systems with different fundamental units. In our case, the source and target libraries had differences like different spacing scales and grid systems (12 vs 24 columns). This limitation meant that while a page could be syntactically migrated correctly, the layout may appear broken upon deployment." The 12→24 grid-column ratio requires a human to decide round up (breaking two-line layouts) vs round down (extra whitespace) — neither is universally correct. (b) "Moody behaviour": "LLM tools occasionally produced inconsistent outputs. These issues appeared without any clear reason, sometimes simply by rerunning the same prompt on the same file at a different time." (c) Processing time makes quick iteration hard.

Operational numbers

Quantity Value Notes
Applications in scope 15 B2B partner-tech apps
Component groups ~10 logical groupings (form, core, …)
Components per group (avg) ~3
Files transformed per group (avg) ~30
Prompt tokens per file (avg) ~45K pre-cache, per group
Output tokens per file (avg) ~2K after fencing extraction
Context-budget sweet spot per group 40–50K tokens above which accuracy declines
Output token limit (per request) 4K recovered via continue prompt
Temperature 0 for determinism
Per-file processing time 30–200 s
Per-repository cost (rough) < $40 before prompt caching
Accuracy (overall, all components) ~90% headline number
Accuracy (low / medium complexity) higher than 90% "even higher"
Launch backend GPT-4o September 2024, post-GPT-4o launch

Architecture (as described)

 ┌──────────────────────────────────────────────────────────┐
 │ Migration Toolkit (Python)                                │
 │                                                          │
 │  per component group (form / core / …):                  │
 │                                                          │
 │   ┌─────────────── STATIC PREFIX (cacheable) ─────────┐  │
 │   │ system prompt (role + output format)              │  │
 │   │ for each component in the group:                  │  │
 │   │   • interface_details                             │  │
 │   │   • mapping_instruction                           │  │
 │   │   • examples                                      │  │
 │   └───────────────────────────────────────────────────┘  │
 │                                                          │
 │   ┌──────────── DYNAMIC SUFFIX (per file) ────────────┐  │
 │   │ <file>                                            │  │
 │   │  {file_content}                                   │  │
 │   │ </file>                                           │  │
 │   └───────────────────────────────────────────────────┘  │
 │              │                                           │
 │              ▼                                           │
 │   OpenAI Chat Completions API (GPT-4o, temperature=0)    │
 │              │                                           │
 │              ▼                                           │
 │   response?                                              │
 │     ├─ complete → strip <updatedContent> → write file    │
 │     └─ truncated → conversation.send("continue") → loop  │
 │                                                          │
 │   automated prompt-regression tests: LLM-generated       │
 │   example library replayed in CI                         │
 └──────────────────────────────────────────────────────────┘

Tooling substrate: continue.dev for streamlining "the workflow of attaching source codes and generating prompt context"; the llm Python library's conversation API as the OpenAI wrapper.

Caveats / absences

  • No disclosed false-positive / regression rate on the unhappy-path 10%. Human code review catches it; operational cost of the review step isn't disclosed.
  • No disclosed migration duration — only cost and per-file time. Does not say how many weeks / months the 15-app migration took end-to-end.
  • Project-specific challenges required manual work"differences in design philosophies of the two UI component libraries, difficulties in migrating test suites due to inconsistent practices, gaps in feature availability between the libraries, and variations in codebases and styling practices across applications. These challenges often required significant manual work and refactoring, as LLMs could not handle such complex transformations accurately." The 90% accuracy is within the transformable scope, not over the total migration work.
  • Visual regression testing is named as necessary but not described. "For example, code reviews and thorough visual testing would be needed for catching subtle issues that LLMs might introduce." No screenshot-diff / Playwright / Chromatic pipeline disclosed.
  • No AST layer. Zalando chose LLM-only; the Slack Enzyme→RTL project chose AST+LLM hybrid. Zalando does not directly compare, but iteration 1 (source-code-only) failing with "multiple complex intermediary steps" is an implicit acknowledgement of the AST-shaped problem — they solved it by pre-computing the interface + mapping offline (takeaway 2) rather than with a codemod.
  • GPT-4o is the only backend. Article timeframe is September 2024; no disclosure of whether later models (GPT-4.1, o-series, Claude 3.5) were tested, and the 90% / $40 numbers are specifically GPT-4o in 2024.
  • "Moody" LLM behaviour. Temperature=0 helps but does not eliminate run-to-run variance. No quantification of residual variance.

Source

Last updated · 501 distilled / 1,218 read