Skip to content

CONCEPT Cited by 1 source

Input image selection tradeoff

Definition

When a vision-language model (VLM) pipeline ingests images to extract attributes, the specific image(s) chosen from the available set is a first-order lever on output quality — separate from prompt engineering and separate from model choice. Not every available image is equally informative, and sending the wrong ones costs both accuracy and tokens.

This is distinct from the upstream question "do we use images at all?" (the concepts/multi-modal-attribute-extraction question). This is the "given we're sending images, which ones?" question.

Why images aren't fungible

A product in an e-commerce catalog typically has multiple images:

  • Product-only front (plain background, product centred).
  • Product-only back / side / detail.
  • Model-worn front / back / lifestyle.
  • Lifestyle / campaign shots (product in context, multiple items).

Each image type carries different signal density for attribute extraction:

Image type Signal quality for attributes
Product-only front Highest — maximum pixels on the product, minimum distractors
Model-worn front High — near-product coverage plus some wearing context
Product-only back/detail Partial — certain attributes only (rear design, trims)
Lifestyle / multi-item Low — product-of-interest is small fraction of frame, other items confuse the model

Tradeoff axes

Signal vs. token cost

Each image passed to the VLM costs tokens (roughly proportional to resolution × tile count). Sending all available images maximises signal but multiplies cost per call and can blow the context window.

Best single vs. best ensemble

A pipeline can either: - Pick the single best image per article (lowest cost, simplest path). Zalando's disclosed default. - Pick a ranked top-N (more signal, more cost, parsing complexity — which image did the model derive each attribute from?).

Different attributes want different images

"Heel height" is best extracted from a side view of a shoe; "sleeve length" from a front model-worn shot; "fabric composition" from a detail shot. A pipeline that always sends the same image type has blind spots that no prompt can fix.

Empirical ranking (Zalando)

Zalando's disclosed ranking, by accuracy of downstream attribute suggestions (Source: sources/2024-09-17-zalando-content-creation-copilot-ai-assisted-product-onboarding):

  1. Product-only front"delivering the best results".
  2. Model-worn front"followed closely".
  3. Other image types — implicitly lower by process of elimination.

From the post: "We found out some image types performed better than others, with product-only front images delivering the best results, followed closely by front images featuring the products being worn by the model."

Trade-offs

  • Cheaper than multi-image ensembles. Picking one best-signal image is the minimum cost-per-call path; ensembles multiply token cost per attribute.
  • Rank rot. The "best" image type can shift as VLMs improve. A model that once couldn't handle lifestyle clutter may handle it fine in the next generation — revisiting the image-selection rule after a backend swap is necessary.
  • Attribute blind spots. Always sending a front shot means back-of-garment attributes (zipper type, tag placement) are invisible. At the extreme, image selection becomes attribute-dependent: different images for different attributes, which is expensive but can be targeted to the attributes where it actually matters.
  • Availability skew. Not every SKU has a product-only front; some only have model-worn. A fallback ladder (product-only → model-worn → lifestyle as last resort) is needed for full catalog coverage.

Relation to the category-relevance map

concepts/category-attribute-relevance-mapping decides which attributes to ask about per category. Image-selection decides which images to look at per call. The two filters compose — a well-curated category- relevance map + a smart image-selection policy both reduce cost and raise accuracy at the same time.

Seen in

Last updated · 501 distilled / 1,218 read