Skip to content

CONCEPT Cited by 2 sources

Multi-modal attribute extraction

Definition

Multi-modal attribute extraction is the pattern of using a vision-language model (VLM) — one that natively takes both image and text inputs — to extract a structured attribute value about a product / entity / document, reasoning across both modalities in a single forward pass.

The distinguishing feature is cross-modal reasoning: the model may use the image to find the value when text is missing, use text when the image is missing, OR cross-reference the two to verify or disambiguate.

Why the text-only baseline is insufficient

E-commerce / catalog data is structurally multi-modal:

  • Value only in image. Sheet counts, serving sizes, and "organic" / "non-GMO" badges are often printed on the packaging image and absent from the text description or database field. A text-only pipeline systematically misses these — not because it's a bad model, but because the signal isn't in its input.
  • Value only in text, but implicit. A description of "3 boxes of 124 tissues" never states the total sheet count (372) — it requires arithmetic reasoning over unstructured text. Traditional text ML finds this hard.
  • Cross-reference improves precision. When text says "Orange Drink, also available in Grape, Strawberry", the primary flavor disambiguates from the image (packaging color, fruit imagery).

The text-only LLM partially solves the reasoning problem (it can multiply 3 × 124); it does not solve the image-only-signal problem.

Three extraction paths, picked by input availability

A multi-modal extraction pipeline adaptively uses whichever signal is available per product:

Text available? Image available? Path
Yes No Text-only reasoning
No Yes Pure image extraction (value printed on packaging)
Yes Yes Cross-reference + consistency check

The same model handles all three without routing logic — the VLM receives both inputs and the prompt asks "extract attribute X from whichever signal is informative".

Empirical lift (Instacart PARSE sheet_count)

  • Legacy SQL rules: poor — can't parse "3 boxes of 124" or read images.
  • Text-only LLM: "significant jump in both recall and precision" over SQL, due to arithmetic + contextual reasoning over unstructured description.
  • Multi-modal LLM on top of text-only: +10% recall, driven by cases where the value is image-only or requires cross-reference.

From the post:

"Text-only LLMs already delivered a significant jump in both recall and precision compared to legacy SQL approaches, thanks to their ability to reason through complex or implicit product descriptions. Multi-modal LLMs further increased recall by 10% over text-only models, since they could pull in image-based cues when available — capturing cases where key details appear solely on packaging or where cross-referencing both sources is necessary."

Tradeoffs / gotchas

  • Multi-modal LLMs are more expensive per call — processing image tokens costs more than text tokens. Don't pay the premium if the value is reliably in text. See concepts/llm-cascade — text-only first, multi- modal only when text confidence is low.
  • Image-token budget is a real constraint. Product catalogs can have many images per SKU; you typically pick one primary image or risk blowing the context window.
  • OCR-style attributes interact with image quality. Low-resolution packaging shots, non-English labels, or obscured numeric fields lose accuracy regardless of the VLM's capability.
  • Hallucination from visual noise. VLMs can "read" numbers / text that aren't present, especially on ambiguous packaging. Pair with a self-verification entailment prompt.
  • Image-embedding as substitute may be cheaper — for some attributes (brand, category) a fine-tuned vision classifier is cheaper and more accurate than a multi-modal LLM. Multi-modal LLM shines when the attribute space is open-ended (e.g. "what's the flavor") and can't be enumerated in advance.

Seen in

Last updated · 319 distilled / 1,201 read