CONCEPT Cited by 1 source
Opaque attribute code translation layer¶
Definition¶
An opaque attribute code translation layer is a bidirectional shim that translates internal identifier codes into human-readable natural language on the way into an LLM, and translates the LLM's natural-language response back into identifier codes on the way out. It is the vocabulary bridge between a system whose internal representation is opaque (numeric IDs, stable codes, i18n-keyed enums) and an LLM whose representation is English (or another natural language).
The layer is purely inbound + outbound string rewriting. It does not change the prompt's semantics, the model's reasoning, or the catalog's storage — it just makes each end of the conversation legible to the other.
Why the naive alternative fails¶
The naive option — pass the codes directly to the LLM — fails in three ways:
- The LLM has no training prior over internal codes. A
code like
assortment_type_7312carries no semantic signal. Even a frontier model can't guess that7312meansPetite. - Hallucination risk is high. Faced with opaque codes
the model may invent plausible-looking codes
(
assortment_type_7355) that don't exist in the schema, producing output that passes shape-level validation but refers to no real attribute value. - Prompt-engineering loses leverage. You can't
"explain the difference between Petite and Tall" in a
prompt if the values in the prompt are
7312and7841— the explanation and the value can't be cross-referenced.
Mechanism (bidirectional)¶
Inbound (code → English)¶
At prompt-construction time, resolve each attribute code to its canonical English label from a metadata source (in Zalando's case, Article Masterdata):
raw codes: [assortment_type_7312, assortment_type_7841]
│
▼ (metadata lookup)
prompt: "The assortment_type can be Petite or Tall."
Outbound (English → code)¶
After the LLM responds, map the English values back to their canonical codes before storing / displaying:
LLM response: {"assortment_type": "Petite"}
│
▼ (reverse lookup)
storage payload: {"assortment_type": "assortment_type_7312"}
The outbound layer also discards irrelevant output: hallucinated fields not in the schema, English labels that don't map back to any code, malformed structure.
Where the layer lives¶
Usually colocated with prompt construction, not with the catalog or the LLM SDK:
- If colocated with the catalog, you couple domain-specific translation logic with business data — violates separation of concerns and complicates a catalog refactor.
- If colocated with the LLM SDK / client, you couple model-facing concerns with catalog semantics — complicates swapping the backend.
- Colocated with a prompt-materialisation service (Prompt Generator in Zalando's case), the layer sits at the only point that already knows both vocabularies.
Tradeoffs¶
- Translation errors become correctness errors silently. If the code→English map is stale (new value added, old value renamed in an i18n file), the LLM sees the wrong options and returns the wrong answer. The translation layer needs to be treated as production-critical data, not a labeling concern.
- Round-trip loss on near-synonyms. If the LLM returns "petite" (lowercase) or "petite fit" and the catalog only knows "Petite", the reverse mapping needs fuzzy normalisation — a common bug surface.
- The translation payload adds tokens. For an attribute with many allowed values, the prompt grows linearly with vocabulary size. Paid on every call — for a catalog with N attributes × M allowed values per attribute, this is often the largest single chunk of the prompt.
- Portability premium. Keeping the translation Zalando- internal (rather than fine-tuning a model on the codes) preserves the ability to swap LLM backends. Zalando's GPT-4 Turbo → GPT-4o migration validated this premium once.
Generalisation¶
The pattern applies anywhere an LLM is wedged into a system whose internal vocabulary is not natural language:
- Catalog systems with numeric SKU attribute codes.
- i18n-keyed systems where fields are message-catalog
lookups (
nav.home,cart.empty). - Enum-based domain models where backend names are UpperSnakeCase constants.
- Protocol / schema translation (LLM sees
"Active", DB sees enum1).
Seen in¶
- sources/2024-09-17-zalando-content-creation-copilot-ai-assisted-product-onboarding
— canonical wiki instance. Zalando's
Prompt Generator
implements the layer in both directions for the Content
Creation Copilot. Worked example:
assortment_type_7312 ↔ Petite,assortment_type_7841 ↔ Tall. Motivating disclosure: "We built a translation layer that converts OpenAI output into information directly usable by Zalando and discards the part that is not relevant."
Related¶
- systems/zalando-prompt-generator — where the layer lives in Zalando's architecture
- systems/zalando-article-masterdata — the metadata source that powers the code ↔ English lookup table
- systems/zalando-content-creation-copilot — the system that requires the layer
- patterns/llm-attribute-extraction-platform — platform pattern this concept lives inside
- patterns/model-agnostic-suggestion-aggregator — the portability story the translation layer preserves