CONCEPT Cited by 1 source

Opaque attribute code translation layer¶

Definition¶

An opaque attribute code translation layer is a bidirectional shim that translates internal identifier codes into human-readable natural language on the way into an LLM, and translates the LLM's natural-language response back into identifier codes on the way out. It is the vocabulary bridge between a system whose internal representation is opaque (numeric IDs, stable codes, i18n-keyed enums) and an LLM whose representation is English (or another natural language).

The layer is purely inbound + outbound string rewriting. It does not change the prompt's semantics, the model's reasoning, or the catalog's storage — it just makes each end of the conversation legible to the other.

Why the naive alternative fails¶

The naive option — pass the codes directly to the LLM — fails in three ways:

The LLM has no training prior over internal codes. A code like assortment_type_7312 carries no semantic signal. Even a frontier model can't guess that 7312 means Petite.
Hallucination risk is high. Faced with opaque codes the model may invent plausible-looking codes (assortment_type_7355) that don't exist in the schema, producing output that passes shape-level validation but refers to no real attribute value.
Prompt-engineering loses leverage. You can't "explain the difference between Petite and Tall" in a prompt if the values in the prompt are 7312 and 7841 — the explanation and the value can't be cross-referenced.

Mechanism (bidirectional)¶

Inbound (code → English)¶

At prompt-construction time, resolve each attribute code to its canonical English label from a metadata source (in Zalando's case, Article Masterdata):

raw codes: [assortment_type_7312, assortment_type_7841]
     │
     ▼  (metadata lookup)
prompt: "The assortment_type can be Petite or Tall."

Outbound (English → code)¶

After the LLM responds, map the English values back to their canonical codes before storing / displaying:

LLM response: {"assortment_type": "Petite"}
     │
     ▼  (reverse lookup)
storage payload: {"assortment_type": "assortment_type_7312"}

The outbound layer also discards irrelevant output: hallucinated fields not in the schema, English labels that don't map back to any code, malformed structure.

Where the layer lives¶

Usually colocated with prompt construction, not with the catalog or the LLM SDK:

If colocated with the catalog, you couple domain-specific translation logic with business data — violates separation of concerns and complicates a catalog refactor.
If colocated with the LLM SDK / client, you couple model-facing concerns with catalog semantics — complicates swapping the backend.
Colocated with a prompt-materialisation service (Prompt Generator in Zalando's case), the layer sits at the only point that already knows both vocabularies.

Tradeoffs¶

Translation errors become correctness errors silently. If the code→English map is stale (new value added, old value renamed in an i18n file), the LLM sees the wrong options and returns the wrong answer. The translation layer needs to be treated as production-critical data, not a labeling concern.
Round-trip loss on near-synonyms. If the LLM returns "petite" (lowercase) or "petite fit" and the catalog only knows "Petite", the reverse mapping needs fuzzy normalisation — a common bug surface.
The translation payload adds tokens. For an attribute with many allowed values, the prompt grows linearly with vocabulary size. Paid on every call — for a catalog with N attributes × M allowed values per attribute, this is often the largest single chunk of the prompt.
Portability premium. Keeping the translation Zalando- internal (rather than fine-tuning a model on the codes) preserves the ability to swap LLM backends. Zalando's GPT-4 Turbo → GPT-4o migration validated this premium once.

Generalisation¶

The pattern applies anywhere an LLM is wedged into a system whose internal vocabulary is not natural language:

Catalog systems with numeric SKU attribute codes.
i18n-keyed systems where fields are message-catalog lookups (nav.home, cart.empty).
Enum-based domain models where backend names are UpperSnakeCase constants.
Protocol / schema translation (LLM sees "Active", DB sees enum 1).

Seen in¶

sources/2024-09-17-zalando-content-creation-copilot-ai-assisted-product-onboarding — canonical wiki instance. Zalando's Prompt Generator implements the layer in both directions for the Content Creation Copilot. Worked example: assortment_type_7312 ↔ Petite, assortment_type_7841 ↔ Tall. Motivating disclosure: "We built a translation layer that converts OpenAI output into information directly usable by Zalando and discards the part that is not relevant."

systems/zalando-prompt-generator — where the layer lives in Zalando's architecture
systems/zalando-article-masterdata — the metadata source that powers the code ↔ English lookup table
systems/zalando-content-creation-copilot — the system that requires the layer
patterns/llm-attribute-extraction-platform — platform pattern this concept lives inside
patterns/model-agnostic-suggestion-aggregator — the portability story the translation layer preserves