CONCEPT Cited by 1 source

Input image selection tradeoff¶

Definition¶

When a vision-language model (VLM) pipeline ingests images to extract attributes, the specific image(s) chosen from the available set is a first-order lever on output quality — separate from prompt engineering and separate from model choice. Not every available image is equally informative, and sending the wrong ones costs both accuracy and tokens.

This is distinct from the upstream question "do we use images at all?" (the concepts/multi-modal-attribute-extraction question). This is the "given we're sending images, which ones?" question.

Why images aren't fungible¶

A product in an e-commerce catalog typically has multiple images:

Product-only front (plain background, product centred).
Product-only back / side / detail.
Model-worn front / back / lifestyle.
Lifestyle / campaign shots (product in context, multiple items).

Each image type carries different signal density for attribute extraction:

Image type	Signal quality for attributes
Product-only front	Highest — maximum pixels on the product, minimum distractors
Model-worn front	High — near-product coverage plus some wearing context
Product-only back/detail	Partial — certain attributes only (rear design, trims)
Lifestyle / multi-item	Low — product-of-interest is small fraction of frame, other items confuse the model

Tradeoff axes¶

Signal vs. token cost¶

Each image passed to the VLM costs tokens (roughly proportional to resolution × tile count). Sending all available images maximises signal but multiplies cost per call and can blow the context window.

Best single vs. best ensemble¶

A pipeline can either: - Pick the single best image per article (lowest cost, simplest path). Zalando's disclosed default. - Pick a ranked top-N (more signal, more cost, parsing complexity — which image did the model derive each attribute from?).

Different attributes want different images¶

"Heel height" is best extracted from a side view of a shoe; "sleeve length" from a front model-worn shot; "fabric composition" from a detail shot. A pipeline that always sends the same image type has blind spots that no prompt can fix.

Empirical ranking (Zalando)¶

Zalando's disclosed ranking, by accuracy of downstream attribute suggestions (Source: sources/2024-09-17-zalando-content-creation-copilot-ai-assisted-product-onboarding):

Product-only front — "delivering the best results".
Model-worn front — "followed closely".
Other image types — implicitly lower by process of elimination.

From the post: "We found out some image types performed better than others, with product-only front images delivering the best results, followed closely by front images featuring the products being worn by the model."

Trade-offs¶

Cheaper than multi-image ensembles. Picking one best-signal image is the minimum cost-per-call path; ensembles multiply token cost per attribute.
Rank rot. The "best" image type can shift as VLMs improve. A model that once couldn't handle lifestyle clutter may handle it fine in the next generation — revisiting the image-selection rule after a backend swap is necessary.
Attribute blind spots. Always sending a front shot means back-of-garment attributes (zipper type, tag placement) are invisible. At the extreme, image selection becomes attribute-dependent: different images for different attributes, which is expensive but can be targeted to the attributes where it actually matters.
Availability skew. Not every SKU has a product-only front; some only have model-worn. A fallback ladder (product-only → model-worn → lifestyle as last resort) is needed for full catalog coverage.

Relation to the category-relevance map¶

concepts/category-attribute-relevance-mapping decides which attributes to ask about per category. Image-selection decides which images to look at per call. The two filters compose — a well-curated category- relevance map + a smart image-selection policy both reduce cost and raise accuracy at the same time.

Seen in¶

sources/2024-09-17-zalando-content-creation-copilot-ai-assisted-product-onboarding — canonical wiki instance. Zalando's Prompt Generator (systems/zalando-prompt-generator) encodes a ranked image-type preference; product-only front is the preferred input. Disclosed as an engineering challenge: "A further challenge involved identifying the optimal set of images to enhance input quality while balancing cost efficiency."

systems/zalando-prompt-generator — where the selection logic lives in Zalando's architecture
systems/zalando-content-creation-copilot
systems/gpt-4o — the VLM backend in Zalando's case
concepts/multi-modal-attribute-extraction — the upstream concept that requires this tradeoff
concepts/category-attribute-relevance-mapping — the sibling filter on the attribute axis
patterns/llm-attribute-extraction-platform — platform pattern this concept lives inside