CONCEPT Cited by 2 sources
Multi-modal attribute extraction¶
Definition¶
Multi-modal attribute extraction is the pattern of using a vision-language model (VLM) — one that natively takes both image and text inputs — to extract a structured attribute value about a product / entity / document, reasoning across both modalities in a single forward pass.
The distinguishing feature is cross-modal reasoning: the model may use the image to find the value when text is missing, use text when the image is missing, OR cross-reference the two to verify or disambiguate.
Why the text-only baseline is insufficient¶
E-commerce / catalog data is structurally multi-modal:
- Value only in image. Sheet counts, serving sizes, and "organic" / "non-GMO" badges are often printed on the packaging image and absent from the text description or database field. A text-only pipeline systematically misses these — not because it's a bad model, but because the signal isn't in its input.
- Value only in text, but implicit. A description of "3 boxes of 124 tissues" never states the total sheet count (372) — it requires arithmetic reasoning over unstructured text. Traditional text ML finds this hard.
- Cross-reference improves precision. When text says "Orange Drink, also available in Grape, Strawberry", the primary flavor disambiguates from the image (packaging color, fruit imagery).
The text-only LLM partially solves the reasoning problem (it can multiply 3 × 124); it does not solve the image-only-signal problem.
Three extraction paths, picked by input availability¶
A multi-modal extraction pipeline adaptively uses whichever signal is available per product:
| Text available? | Image available? | Path |
|---|---|---|
| Yes | No | Text-only reasoning |
| No | Yes | Pure image extraction (value printed on packaging) |
| Yes | Yes | Cross-reference + consistency check |
The same model handles all three without routing logic — the VLM receives both inputs and the prompt asks "extract attribute X from whichever signal is informative".
Empirical lift (Instacart PARSE sheet_count)¶
- Legacy SQL rules: poor — can't parse "3 boxes of 124" or read images.
- Text-only LLM: "significant jump in both recall and precision" over SQL, due to arithmetic + contextual reasoning over unstructured description.
- Multi-modal LLM on top of text-only: +10% recall, driven by cases where the value is image-only or requires cross-reference.
From the post:
"Text-only LLMs already delivered a significant jump in both recall and precision compared to legacy SQL approaches, thanks to their ability to reason through complex or implicit product descriptions. Multi-modal LLMs further increased recall by 10% over text-only models, since they could pull in image-based cues when available — capturing cases where key details appear solely on packaging or where cross-referencing both sources is necessary."
Tradeoffs / gotchas¶
- Multi-modal LLMs are more expensive per call — processing image tokens costs more than text tokens. Don't pay the premium if the value is reliably in text. See concepts/llm-cascade — text-only first, multi- modal only when text confidence is low.
- Image-token budget is a real constraint. Product catalogs can have many images per SKU; you typically pick one primary image or risk blowing the context window.
- OCR-style attributes interact with image quality. Low-resolution packaging shots, non-English labels, or obscured numeric fields lose accuracy regardless of the VLM's capability.
- Hallucination from visual noise. VLMs can "read" numbers / text that aren't present, especially on ambiguous packaging. Pair with a self-verification entailment prompt.
- Image-embedding as substitute may be cheaper — for some attributes (brand, category) a fine-tuned vision classifier is cheaper and more accurate than a multi-modal LLM. Multi-modal LLM shines when the attribute space is open-ended (e.g. "what's the flavor") and can't be enumerated in advance.
Seen in¶
- sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms
— canonical wiki instance. Instacart's
PARSE uses multi-modal LLMs as
one of its supported extraction algorithms;
sheet_countcase study documents +10% recall over text-only LLM. Two motivating examples: (a) 80-sheets printed on packaging image only; (b) "3 boxes of 124 tissues" needing multiplication reasoning over text. - sources/2026-02-17-instacart-turning-data-into-velocity-capers-edge-and-cloud-data-flywheel-with-capsight — Future-work target on a different Instacart platform. Instacart's Capsight (edge→cloud data flywheel for Caper smart carts) explicitly names full sensor fusion — camera + weight + motion + location — fed into a foundation model as the next step beyond Phase-1 CV-only. Shows the concept generalising beyond catalog attributes (PARSE's domain) to real-world in-store environment understanding (intent detection, multi-item interactions).
Related¶
- concepts/vlm-as-image-judge — the image-scoring sibling; multi-modal extraction emits a value, VLM-as-judge emits a score. Both use VLMs for cross-modal reasoning.
- concepts/llm-self-verification — pair with self-verify to catch multi-modal hallucinations.
- concepts/llm-cascade — text → multi-modal cascade reserves the more expensive VLM for inputs text alone can't solve.
- patterns/llm-attribute-extraction-platform — the pattern this concept instantiates.
- systems/instacart-parse — canonical production instance.