Skip to content

SYSTEM Cited by 1 source

Zalando Component Migration Toolkit

Definition

Zalando's Component Migration Toolkit is a Python-based internal tool the Zalando Partner Tech department built in September 2024 to migrate 15 B2B applications from one in-house UI component library to another. Wraps the Datasette llm Python library's conversation API around GPT-4o with a structured system prompt that enforces the <updatedContent> output-fencing contract, temperature=0 for determinism, and a static-prefix-plus-dynamic-suffix prompt layout engineered for prompt-cache hits.

Architecture (as disclosed)

  • Input: a source directory path. The tool walks files under that root and, for each file, invokes the migration request.
  • Static prefix (cacheable): the system prompt plus, for each component in the current logical group (form / core / …), the component interface, transformation mapping, and migration examples.
  • Dynamic suffix: the <file>{file_content}</file> block specific to the file being transformed.
  • Model: GPT-4o at temperature=0.
  • Output contract: LLM must return the transformed file inside <updatedContent>…</updatedContent> with no other text; a downstream parser strips the fence and writes the file.
  • Truncation recovery: if the response hits the 4K output-token limit, the toolkit sends the literal prompt "continue" on the same conversation and concatenates the completion. "A simple 'continue' prompt proved more reliable than more complex prompts to continue the transformation" (Source: sources/2025-02-19-zalando-llm-powered-migration-of-ui-component-libraries).
  • Context budget discipline: components are partitioned into logical groups ("form, core, etc.") sized so that one group's prompt stays at 40–50K context tokens — empirically the accuracy-sweet-spot. The tool is run once per group per file, not once per entire library.
  • Prompt-regression tests: an LLM-generated example library (the same examples that ride in the prompt) is replayed through the pipeline in CI; any divergence from the golden output signals prompt drift.
  • Authoring workflow: prompts were assembled via continue.dev — its IDE integration automates attaching source files to a prompt context, which "improved our workflow" over manual copy-paste.

Operational envelope

  • Processing time: 30–200 s per file.
  • Cost: < $40 per repository rough estimate under GPT-4o pricing, averaging 45K prompt / 2K output tokens per file, ~10 component groups × ~3 components × ~30 files per group. Prompt caching further reduces this.
  • Accuracy: ~90% overall across low/medium/high complexity; "even higher accuracy for components of low to medium complexity."
  • Scope: 15 sophisticated B2B applications across the Partner Tech department.

Why LLM-only, not AST+LLM

Zalando's iteration-1 experiment handed raw source code to the LLM and failed — "multiple complex intermediary steps" in a single pass. Rather than inserting an AST codemod layer (the Slack Enzyme→RTL AST+LLM hybrid route), Zalando pre-computed the interface and mapping offline through four more experiment rounds and froze them into the prompt. The human effort (mapping verification by pair programmers + designers) that would have gone into AST rule authoring went into visual-equivalence verification instead. See concepts/visual-equivalence-mapping for why: the LLM "couldn't visualize how components are rendered", and design-intent information isn't recoverable from source code alone.

The toolkit is therefore the canonical wiki instance of the LLM-only code- migration pipeline pattern — an alternative to the AST+LLM hybrid for organisations where the source-target delta is small enough to encode as text (interface + mapping + examples) but large enough that codemods are expensive to author per edge case.

Tradeoffs / limits

  • No visual-equivalence verification. The toolkit migrates the code but cannot verify the rendered output. Grid-column ratio mismatches (12 vs 24) produce syntactically-correct, visually-broken pages; human review catches these.
  • "Moody" residual variance. Temperature=0 gives reproducibility for a fixed input, but "LLM tools occasionally produced inconsistent outputs. These issues appeared without any clear reason, sometimes simply by rerunning the same prompt on the same file at a different time" — provider-side non-determinism that the toolkit can't control.
  • Processing-time discourages rapid iteration. 30–200 s per file makes "conducting quick, small-scale experiments more challenging", which is why the LLM-generated example library is replayed in CI rather than re-run per prompt change.
  • GPT-4o is the only disclosed backend (September 2024 timeframe). No multi-provider abstraction described.
  • Test migration is partially outside scope. "Difficulties in migrating test suites due to inconsistent practices" was named as a project-specific challenge requiring manual work.

Seen in

Last updated · 501 distilled / 1,218 read