Skip to content

ZALANDO 2024-09-17

Read original ↗

Zalando — Content Creation Copilot: AI-assisted product onboarding

Summary

Zalando (2024-09-17) documents the architecture and early production results of its Content Creation Copilot — an internal system that auto-generates product-attribute suggestions during the article-onboarding workflow, reducing the ~25% of the content-production timeline that was previously spent on manual copywriting + four-eyes QA. The copilot is a thin orchestration layer between Zalando's existing Content Creation Tool (the copywriter-facing UI), Article Masterdata (the system-of-record for Zalando's internal attribute codes and per-article-type attribute sets), a Prompt Generator service (which materialises natural- language prompts from the attribute schema), and a multi-modal LLM backend (OpenAI GPT-4 Turbo at launch, migrated to GPT-4o as soon as that model shipped). Suggestions appear in the Content Creation Tool as pre-selected attribute values marked with a purple dot — an explicit design choice to (a) shift the human's workload from enrichment-then-QA to QA-only and (b) keep the final decision auditable with a visible AI-provenance indicator. The post discloses 75% suggestion accuracy and ~50,000 attributes enriched per week across 25 markets. Three architecturally interesting decisions: (1) a translation layer that converts Zalando's opaque attribute codes (assortment_type_7312) into human-readable English (Petite) on the way into GPT and back again into codes on the way out, so the model sees language, not identifiers; (2) a category-to-attribute relevance mapping that suppresses suggestions for attributes that shouldn't be filled for certain product types (GPT accuracy on those was empirically poor); (3) an aggregator service framing that explicitly plans for multiple suggestion sources (OpenAI + partner data + brand dumps + fine-tuned models) behind one copilot contract, preserving the option to swap or cascade providers without changing the Content Creation Tool.

Key takeaways

  1. Manual enrichment was a measured 25% tax on the content- production timeline. Zalando profiled the onboarding workflow and found "the manual process contributed to approximately 25% of the overall content production timeline" — specifically the copywriter-enriches-then- four-eyes-QAs step. That number is what justified building a copilot rather than iterating on the existing UI: the problem was not UX friction, it was raw minutes per SKU across a 25-market catalog.

  2. The copilot is a four-service composition, not a model wrapper. The architecture explicitly names four components: Content Creation Tool (upload images, receive + auto-select suggestions), Article Masterdata (attribute codes and attribute-set definitions per article type), Prompt Generator (materialises prompts from Masterdata and image URLs), and OpenAI-GPT (the LLM backend). The Prompt Generator is the load-bearing component: it is the thing that turns a schema + an image into a model call and the thing that turns a model response back into a structured attribute suggestion. (Source: this article.)

  3. Opaque attribute codes force a translation layer in both directions. Zalando stores attributes as codes like assortment_type_7312, assortment_type_7841 — identifiers that mean nothing to a language model. The prompt generator translates codes to English on the way in (assortment_type_7312Petite, assortment_type_7841Tall), asks GPT to pick from the English values, then translates the English response back to codes before writing to the Content Creation Tool. "We built a translation layer that converts OpenAI output into information directly usable by Zalando and discards the part that is not relevant." This is a generalisable requirement whenever an LLM is wedged into a system whose internal vocabulary is opaque identifiers, not domain language. See concepts/opaque-attribute-code-translation-layer.

  4. Purple-dot pre-selection is a deliberate UX disclosure pattern, not a cosmetic choice. Suggested attributes are "pre-selected and marked with a purple dot to make users aware that these attributes were auto-suggested". Two properties: pre-selected means the default path is accept-the-AI — the human's work shifts from entering values to reviewing them (which was the existing QA altitude anyway, so the workflow doesn't demand new muscle); visually distinguished means the AI origin is never hidden, preserving the four-eyes principle and the auditability story. See patterns/pre-select-ai-suggestions-with-visual-disclosure and concepts/ai-provenance-ui-indicator.

  5. Some attributes shouldn't be suggested at all; a category-to-attribute relevance mapping suppresses them. "Some attributes shouldn't be filled for certain types of articles according to the internal guidelines, and the accuracy of predicted suggestions for these attributes was often poor. To address this, we introduced a mapping layer between product categories and the relevant information that should be shown to the customer." This is a quality- through-scope-reduction lever: rather than trying to make the LLM smarter on attributes where it struggles, remove those attributes from the prompt for irrelevant categories entirely. See concepts/category-attribute-relevance-mapping.

  6. Model swap (GPT-4 Turbo → GPT-4o) was a cost + latency + accuracy net-positive the aggregator framing made cheap. "As GPT-4o was announced relatively early in the copilot's development, we initially performed a human inspection, comparing the accuracy of different sources for sample articles. The new model not only provided better results but also delivered faster response times and proved to be more cost-effective." The same aggregator contract (one copilot API, many backends) that enabled this swap is also the mechanism for future integrations with brand data dumps, partner contributions, and fine-tuned models. See patterns/model-agnostic-suggestion-aggregator.

  7. Input-image choice is a real lever on output quality. "Product-only front images delivering the best results, followed closely by front images featuring the products being worn by the model." Empirical image-quality ranking drives a selection policy inside the Prompt Generator — not every available catalog image is equally informative, and choosing the wrong one costs both accuracy and tokens. See concepts/input-image-selection-tradeoff.

  8. Balanced evaluation datasets overstate error on long-tail fashion vocabulary. "GPT-4o model tends to suggest general attributes like 'V-necks' or 'round necks' for 'necklines' correctly, but can be less precise when it comes to more fashion-specific ones, like 'deep scoop necks'. This issue is more noticeable when using balanced datasets (where there's an equal number of samples per attribute) compared to unbalanced ones (where the sample proportions reflect real-world trends)." The copilot's production accuracy on the real article distribution is higher than a balanced-eval number would suggest — a useful reminder that eval-set design changes the headline number on model quality, independent of the model itself.

  9. Cost pressure came from the long-tail, solved by scope reduction + model swap. "Reducing the infrastructure costs of suggestions generation, which were higher than expected. First, we stopped generating suggestions for some unsupported attribute sets. Second, we migrated to GPT-4o model, which significantly lowered costs." Neither remediation required new infra; both fall out of the aggregator + category-mapping design. Multi-attribute prompt batching (send N attributes in one call, covered in the Instacart wiki canonical patterns/multi-attribute-multi-product-prompt-batching) is a natural next cost lever the post doesn't yet name.

Operational numbers

Metric Value Source quote
Manual-enrichment share of content pipeline ~25% "approximately 25% of the overall content production timeline"
Production accuracy ~75% "We've achieved an accuracy rate of approximately 75%"
Attributes enriched / week ~50,000 "enriching around 50,000 attributes on average per week"
Markets served 25 "Zalando operates in 25 markets with different languages"
Backend model at launch OpenAI GPT-4 Turbo "we decided to use the OpenAI GPT-4 Turbo model"
Backend model after swap OpenAI GPT-4o "we migrated to GPT-4o model, which significantly lowered costs"
Best image type for suggestions product-only front "product-only front images delivering the best results"
Second-best image type model-worn front "followed closely by front images featuring the products being worn by the model"
Weak attribute class fine-grained fashion (e.g. deep scoop neck) "less precise when it comes to more fashion-specific ones"

Architecture

Four-service decomposition for the end-to-end copilot flow:

 Copywriter
     │ (upload images)
┌─────────────────────┐
│ Content Creation    │  ◀── pre-selects suggested values
│ Tool (UI)           │       with purple-dot marker
└──────────┬──────────┘
           │  image URLs
┌─────────────────────┐   attribute codes + attribute sets
│ Prompt Generator    │◀────────────────────────── Article Masterdata
│  - code→English     │
│  - category filter  │
│  - image selection  │
└──────────┬──────────┘
           │ prompt + image URLs
┌─────────────────────┐
│ OpenAI GPT-4 Turbo  │
│   / GPT-4o          │
└──────────┬──────────┘
           │ English-labeled suggestions
┌─────────────────────┐
│ Prompt Generator    │  English → code translation
│  (reverse path)     │  + discard irrelevant output
└──────────┬──────────┘
           │ attribute codes
  Content Creation Tool

The post describes this as "simplified current workflow" — the aggregator framing implies additional backends (partner data, brand dumps, fine-tuned models) will be added behind the same Prompt Generator → suggestion contract, with the Content Creation Tool remaining blind to which backend produced which suggestion (beyond a uniform "AI-suggested" purple-dot marker).

Caveats

  • Post is an early-results disclosure, not a postmortem or long-running production retrospective. Accuracy (75%), throughput (~50k attrs/week), and the GPT-4o cost win are all launch-phase numbers; drift, per-market variance, and category-level accuracy distribution are not reported.
  • Aggregator is framing, not yet production-complete. Only OpenAI (Turbo, then 4o) is named as an active backend. Brand data dumps, partner contributions, and fine-tuned models are named as future integrations the aggregator is designed to absorb.
  • No per-attribute confidence score disclosed. Unlike Instacart's PARSE (sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms) which emits a self-verification confidence primitive that routes HITL queues, Zalando's copilot relies on the human reviewer to be the sole quality gate at every SKU. No low-confidence-to-human-review routing is disclosed; every suggestion is pre-selected by default and the human decides.
  • No eval / drift-monitoring pipeline disclosed. The post mentions the balanced-vs-unbalanced eval caveat conceptually but does not describe a production random-sampling + LLM-as-judge pipeline analogous to patterns/human-in-the-loop-quality-sampling. Whether one exists internally is unclear.
  • Model choice disclosed is provider-level; fine-tune vs. prompt-only not explicit. The post mentions "fine-tuned models" as a future aggregator backend but does not say whether the current GPT-4o usage involves fine-tuning vs. zero-shot prompting with few-shot examples in the system prompt. Wording ("crafting the prompt") suggests prompt-only.
  • No multi-attribute batching disclosed. Whether a single prompt extracts N attributes per product or one prompt per (product, attribute) pair is not stated. At 50k attrs/week and 25 markets, batching is a plausible next cost win already canonicalised in the Instacart wiki instance.

Source

Last updated · 501 distilled / 1,218 read