Skip to content

CONCEPT Cited by 3 sources

Multi-modal attribute extraction

Definition

Multi-modal attribute extraction is the pattern of using a vision-language model (VLM) — one that natively takes both image and text inputs — to extract a structured attribute value about a product / entity / document, reasoning across both modalities in a single forward pass.

The distinguishing feature is cross-modal reasoning: the model may use the image to find the value when text is missing, use text when the image is missing, OR cross-reference the two to verify or disambiguate.

Why the text-only baseline is insufficient

E-commerce / catalog data is structurally multi-modal:

  • Value only in image. Sheet counts, serving sizes, and "organic" / "non-GMO" badges are often printed on the packaging image and absent from the text description or database field. A text-only pipeline systematically misses these — not because it's a bad model, but because the signal isn't in its input.
  • Value only in text, but implicit. A description of "3 boxes of 124 tissues" never states the total sheet count (372) — it requires arithmetic reasoning over unstructured text. Traditional text ML finds this hard.
  • Cross-reference improves precision. When text says "Orange Drink, also available in Grape, Strawberry", the primary flavor disambiguates from the image (packaging color, fruit imagery).

The text-only LLM partially solves the reasoning problem (it can multiply 3 × 124); it does not solve the image-only-signal problem.

Three extraction paths, picked by input availability

A multi-modal extraction pipeline adaptively uses whichever signal is available per product:

Text available? Image available? Path
Yes No Text-only reasoning
No Yes Pure image extraction (value printed on packaging)
Yes Yes Cross-reference + consistency check

The same model handles all three without routing logic — the VLM receives both inputs and the prompt asks "extract attribute X from whichever signal is informative".

Empirical lift (Instacart PARSE sheet_count)

  • Legacy SQL rules: poor — can't parse "3 boxes of 124" or read images.
  • Text-only LLM: "significant jump in both recall and precision" over SQL, due to arithmetic + contextual reasoning over unstructured description.
  • Multi-modal LLM on top of text-only: +10% recall, driven by cases where the value is image-only or requires cross-reference.

From the post:

"Text-only LLMs already delivered a significant jump in both recall and precision compared to legacy SQL approaches, thanks to their ability to reason through complex or implicit product descriptions. Multi-modal LLMs further increased recall by 10% over text-only models, since they could pull in image-based cues when available — capturing cases where key details appear solely on packaging or where cross-referencing both sources is necessary."

Tradeoffs / gotchas

  • Multi-modal LLMs are more expensive per call — processing image tokens costs more than text tokens. Don't pay the premium if the value is reliably in text. See concepts/llm-cascade — text-only first, multi- modal only when text confidence is low.
  • Image-token budget is a real constraint. Product catalogs can have many images per SKU; you typically pick one primary image or risk blowing the context window.
  • OCR-style attributes interact with image quality. Low-resolution packaging shots, non-English labels, or obscured numeric fields lose accuracy regardless of the VLM's capability.
  • Hallucination from visual noise. VLMs can "read" numbers / text that aren't present, especially on ambiguous packaging. Pair with a self-verification entailment prompt.
  • Image-embedding as substitute may be cheaper — for some attributes (brand, category) a fine-tuned vision classifier is cheaper and more accurate than a multi-modal LLM. Multi-modal LLM shines when the attribute space is open-ended (e.g. "what's the flavor") and can't be enumerated in advance.

Seen in

  • sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms — canonical wiki instance. Instacart's PARSE uses multi-modal LLMs as one of its supported extraction algorithms; sheet_count case study documents +10% recall over text-only LLM. Two motivating examples: (a) 80-sheets printed on packaging image only; (b) "3 boxes of 124 tissues" needing multiplication reasoning over text.
  • sources/2026-02-17-instacart-turning-data-into-velocity-capers-edge-and-cloud-data-flywheel-with-capsightFuture-work target on a different Instacart platform. Instacart's Capsight (edge→cloud data flywheel for Caper smart carts) explicitly names full sensor fusion — camera + weight + motion + location — fed into a foundation model as the next step beyond Phase-1 CV-only. Shows the concept generalising beyond catalog attributes (PARSE's domain) to real-world in-store environment understanding (intent detection, multi-item interactions).
  • sources/2024-09-17-zalando-content-creation-copilot-ai-assisted-product-onboardingFashion-catalog sibling instance at Zalando (systems/zalando-content-creation-copilot). Multi-modal VLM (OpenAI GPT-4 Turbo → GPT-4o) extracts product attributes (neckline, assortment type, colour, fit) from product images during content onboarding. Shares PARSE's core architectural stance (one VLM backend over a schema- driven prompt). Introduces a fashion-specific image- selection policy (concepts/input-image-selection-tradeoff): product-only front images outperform model-worn front images outperform other angles — a ranking that the Prompt Generator encodes as preference order. Empirical weakness disclosed: long-tail fashion vocabulary (specific neckline variants like deep scoop neck) where GPT-4o's general-purpose VLM pre-training produces less precise outputs on balanced eval sets than on production unbalanced distributions — an eval-set-design caveat that is model-agnostic.
Last updated · 542 distilled / 1,571 read