CONCEPT Cited by 1 source
Input image selection tradeoff¶
Definition¶
When a vision-language model (VLM) pipeline ingests images to extract attributes, the specific image(s) chosen from the available set is a first-order lever on output quality — separate from prompt engineering and separate from model choice. Not every available image is equally informative, and sending the wrong ones costs both accuracy and tokens.
This is distinct from the upstream question "do we use images at all?" (the concepts/multi-modal-attribute-extraction question). This is the "given we're sending images, which ones?" question.
Why images aren't fungible¶
A product in an e-commerce catalog typically has multiple images:
- Product-only front (plain background, product centred).
- Product-only back / side / detail.
- Model-worn front / back / lifestyle.
- Lifestyle / campaign shots (product in context, multiple items).
Each image type carries different signal density for attribute extraction:
| Image type | Signal quality for attributes |
|---|---|
| Product-only front | Highest — maximum pixels on the product, minimum distractors |
| Model-worn front | High — near-product coverage plus some wearing context |
| Product-only back/detail | Partial — certain attributes only (rear design, trims) |
| Lifestyle / multi-item | Low — product-of-interest is small fraction of frame, other items confuse the model |
Tradeoff axes¶
Signal vs. token cost¶
Each image passed to the VLM costs tokens (roughly proportional to resolution × tile count). Sending all available images maximises signal but multiplies cost per call and can blow the context window.
Best single vs. best ensemble¶
A pipeline can either: - Pick the single best image per article (lowest cost, simplest path). Zalando's disclosed default. - Pick a ranked top-N (more signal, more cost, parsing complexity — which image did the model derive each attribute from?).
Different attributes want different images¶
"Heel height" is best extracted from a side view of a shoe; "sleeve length" from a front model-worn shot; "fabric composition" from a detail shot. A pipeline that always sends the same image type has blind spots that no prompt can fix.
Empirical ranking (Zalando)¶
Zalando's disclosed ranking, by accuracy of downstream attribute suggestions (Source: sources/2024-09-17-zalando-content-creation-copilot-ai-assisted-product-onboarding):
- Product-only front — "delivering the best results".
- Model-worn front — "followed closely".
- Other image types — implicitly lower by process of elimination.
From the post: "We found out some image types performed better than others, with product-only front images delivering the best results, followed closely by front images featuring the products being worn by the model."
Trade-offs¶
- Cheaper than multi-image ensembles. Picking one best-signal image is the minimum cost-per-call path; ensembles multiply token cost per attribute.
- Rank rot. The "best" image type can shift as VLMs improve. A model that once couldn't handle lifestyle clutter may handle it fine in the next generation — revisiting the image-selection rule after a backend swap is necessary.
- Attribute blind spots. Always sending a front shot means back-of-garment attributes (zipper type, tag placement) are invisible. At the extreme, image selection becomes attribute-dependent: different images for different attributes, which is expensive but can be targeted to the attributes where it actually matters.
- Availability skew. Not every SKU has a product-only front; some only have model-worn. A fallback ladder (product-only → model-worn → lifestyle as last resort) is needed for full catalog coverage.
Relation to the category-relevance map¶
concepts/category-attribute-relevance-mapping decides which attributes to ask about per category. Image-selection decides which images to look at per call. The two filters compose — a well-curated category- relevance map + a smart image-selection policy both reduce cost and raise accuracy at the same time.
Seen in¶
- sources/2024-09-17-zalando-content-creation-copilot-ai-assisted-product-onboarding — canonical wiki instance. Zalando's Prompt Generator (systems/zalando-prompt-generator) encodes a ranked image-type preference; product-only front is the preferred input. Disclosed as an engineering challenge: "A further challenge involved identifying the optimal set of images to enhance input quality while balancing cost efficiency."
Related¶
- systems/zalando-prompt-generator — where the selection logic lives in Zalando's architecture
- systems/zalando-content-creation-copilot
- systems/gpt-4o — the VLM backend in Zalando's case
- concepts/multi-modal-attribute-extraction — the upstream concept that requires this tradeoff
- concepts/category-attribute-relevance-mapping — the sibling filter on the attribute axis
- patterns/llm-attribute-extraction-platform — platform pattern this concept lives inside