Instacart — Scaling Catalog Attribute Extraction with Multi-modal LLMs (PARSE)¶

INSTACART 2025-08-01 Tier 2

Summary¶

Instacart Engineering post (2025-08-01) introducing PARSE (Product Attribute Recognition System for E-commerce) — a self-serve, multi-modal LLM platform for extracting structured product attributes (flavor, size, sheet count, "organic", "low-sugar") across Instacart's catalog of millions of SKUs. The post frames the problem as org-level fragmentation: prior to PARSE, every attribute required its own SQL rules or bespoke text-only ML model, each with hand-labeled datasets and a separate training+serving pipeline — slow, expensive, and blind to information that only appears in product images. PARSE consolidates attribute extraction into four components: (1) Platform UI where users declaratively configure an attribute (name, type, description, prompt template, few-shot examples, input-data SQL, LLM choice) with full version-control of configs; (2) ML Extraction endpoint that materialises the LLM prompt per product, runs the selected extraction algorithm, and emits both the extracted value AND a self-verification confidence score (entailment prompt asking the LLM "is this correct, yes/no?" + logit of the yes token as the probability); (3) Quality Screening for both development-mode (human + LLM- as-a-judge evaluation on a sample) and production-mode (periodic human+LLM sampling for drift detection, plus proactive low- confidence-score triage to human review); (4) an ingestion hand- off into the catalog data pipeline. Three reusable pattern-level ideas surface: multi-modal reasoning closes the gap where text alone is missing the value (e.g. sheet_count visible on packaging only); different attributes need different prompt- tuning effort and different model sizes (cheap LLM = equivalent quality on "organic" at 70% cost; 60% accuracy drop on "low- sugar"); and LLM cost-reduction comes from batching attributes and products into a single prompt + an extraction cache keyed by product similarity. Reported numbers: organic-claim prompt went from 1 week (traditional) to 1 day (PARSE) at 95% accuracy; complex "low-sugar" attribute dropped to 3 days of iteration; multi-modal LLMs lifted recall of sheet_count by 10% over text-only LLMs on top of a much larger lift over legacy SQL; simpler attributes saw 70% cost reduction by downsizing LLM; harder attributes 60% accuracy drop on cheap LLM — making model-per-attribute selection load-bearing.

Key takeaways¶

Org-level fragmentation was the problem, not model quality. Prior Instacart attribute pipelines were a patchwork of SQL rules (scalable but shallow — can catch "organic" keyword but fails when the primary flavor is "Orange" and the description mentions Grape + Strawberry as variants) and bespoke text-only ML models (generalise but every attribute needs its own labeled dataset + trained model + maintained pipeline). "Achieving high-quality results for each attribute requires significant effort — from collecting and labeling specialized datasets to developing, training, and maintaining separate models and pipelines for every attribute of interest. This leads to a slower, more resource-intensive process as the catalog and attribute set grow. Both approaches also share a key limitation: they operate only on product text, leaving important gaps when attribute information is available solely in product images." PARSE collapses this into one configurable, multi-modal platform — the same consolidate-don't- improve-the-model thesis that underpins its sibling PIXEL. (Source: sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms)
Multi-modal reasoning closes the text-only blind spot. Two archetypal examples from the post: (a) a household dry- sheet product where the text says nothing about sheet count but "80 sheets" is printed on the packaging image — only a multi-modal LLM can extract it; (b) a multi-pack product where the description reads "3 boxes of 124 tissues" and only logical deduction (3 × 124 = 372) gets the total sheet count. Either path requires the model to reason across text+image OR cross- reference them to verify. "Text-only LLMs already delivered a significant jump in both recall and precision compared to legacy SQL approaches… Multi-modal LLMs further increased recall by 10% over text-only models, since they could pull in image-based cues when available — capturing cases where key details appear solely on packaging or where cross-referencing both sources is necessary." (Source: sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms)
Self-verification via entailment prompt + logit yields a confidence score. PARSE scores extracted values by sending a second prompt that asks the LLM "is this extracted value correct based on the product features and attribute definition — yes or no?" with the constraint that the first generated token be yes or no. The logit of the yes token (normalised into a probability) becomes the confidence score. "We query the LLM with a second scoring prompt. The prompt will ask LLM to do an entailment task: asking LLM if the extracted attribute value by the extraction prompt is correct based on the product features and attribute definition… we specifically ask LLM to output 'yes' or 'no' first. Then we can get the logit of the first generated token, and compute the token probability of 'yes' as the confidence score." This is a textbook LLM self-verification implementation — the confidence is not sampled, it is a direct read-off of the output distribution. Cites the literature basis (AutoMix [2]). (Source: sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms)
Low-confidence → human review is the production quality gate. PARSE uses the confidence score for proactive error detection in prod: "this process considers the extracted values of products with a low confidence score as potentially incorrect values, and has them reviewed and corrected by human auditors." This is a generic routing-by-uncertainty primitive — see patterns/low-confidence-to-human-review — but crucially it's not the only human-review path: a periodic sample of all extractions (regardless of confidence) is also sent to human / LLM-as-judge review to detect systematic drift the confidence score itself might miss. Two different review populations → two different failure modes caught. (Source: sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms)
Different attributes ⇒ different prompt-tuning effort AND different LLM size — model-per-attribute is load-bearing. The "organic" claim (simple definition: is there an "organic" label?) hit 95% accuracy on the first prompt and took 1 day to ship (vs. a week with the old SQL approach). "Low-sugar" (complex: < N grams per serving, different rules per category, often implicit in nutrition facts image) needed multiple prompt iterations but still shipped in 3 days via the PARSE UI. Crucially on model size: for simple attributes, a "cheaper but less powerful LLM delivered similar quality to more powerful ones at a 70% cost reduction"; for hard attributes, the same cheap LLM had a 60% accuracy drop. Conclusion: "selecting the right extraction model to balance cost and quality effectively" per attribute is first-class, not a micro-optimisation. This is why PARSE exposes the LLM choice as a per-attribute config — see concepts/llm-cascade for the compound version (cheap LLM first, expensive LLM only if low-confidence). (Source: sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms)
Self-serve UX turns prompt iteration into a 3-day workflow. The Platform UI makes attribute creation declarative: attribute name + type + description + prompt template + few-shot examples + input-data SQL + LLM choice are configuration, not code. Configurations are versioned (track changes, identify contributors, revert). "With our PARSE platform, this only took us one day of effort, compared to one week previously when using traditional methods… Conversely, difficult attributes such as the 'low sugar' claim have more complex guidelines and require multiple prompt iterations for high-quality extraction. However, with PARSE, the iteration process for these more challenging attributes was still reduced to just three days due to the easy-to-use PARSE UI design." Same self-serve-generative-AI posture as sibling PIXEL: a config UI + defaults-with-overrides, not a code library. (Source: sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms)
Cost-reduction future work: prompt batching + extraction cache keyed by product similarity. Two named optimisations PARSE plans to explore: (a) multi-attribute OR multi- product batching in a single prompt — avoid re-sending the same product features to extract attribute A then attribute B, OR avoid re-sending the same attribute definition for every product. Both are forms of prompt batching and amortise the shared-context token cost across many outputs. (b) LLM approximation — a cache of prior extraction results keyed by a product-similarity function: if a new product is "similar enough" to one whose attribute was already extracted, return the cached value instead of re-calling the LLM. Explicitly calls out the hard part: "we will need to define a similarity function that is able to help determine if two products have the same attribute values. This will be a challenging problem but there is ongoing work in duplicate product detection that we can take advantage of." Generalises to patterns/llm-extraction-cache-by-similarity. (Source: sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms)
LLM-as-a-judge is used in both development and production. Development mode: the platform ships an LLM-as-judge auto-evaluator alongside the human-eval interface so that small sample quality can be estimated quickly between prompt iterations without blocking on humans. Production mode: LLM auto-eval is part of the drift-detection periodic sample (alongside human auditors) to catch quality regressions on newly onboarded products. Matches the LLM-as-a-judge pattern's "regression harness + continuous production-sample eval" dual usage. (Source: sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms)

Architecture components (named by the post)¶

#	Component	Responsibility
1	Platform UI	Declarative, versioned config of attribute name/type/description, extraction prompt template, few-shot examples, input-data SQL, LLM model choice + extraction algorithm
2	ML Extraction endpoint	Materialise prompt per product, run extraction LLM, run self-verification (entailment + logit) to emit confidence score
3	Quality Screening	Dev-mode: human-labeling UI + LLM-as-judge auto-eval. Prod-mode: periodic sample for drift detection + low-confidence-score → human review for proactive error correction
4	Catalog ingestion	Final extracted values flow into the Instacart catalog data pipeline

Operational numbers disclosed¶

Organic claim: 95% accuracy on first prompt; 1 day PARSE vs. 1 week traditional.
Low-sugar claim (complex): 3 days iteration in PARSE (no pre-PARSE baseline disclosed; implied weeks).
sheet_count: multi-modal LLM gains +10% recall over text-only LLM; text-only LLM was already "significant jump" in recall+precision over SQL (no absolute numbers).
Cost/quality per attribute × model size:
Simple attributes: cheap LLM = same quality, -70% cost vs. expensive LLM.
Hard attributes: cheap LLM = -60% accuracy vs. expensive LLM.
Catalog scale: "millions of SKUs across thousands of categories" — no breakdown by category.

Caveats¶

No confidence-score calibration numbers. The yes-token logit is used as-is; whether it's well-calibrated (low Expected Calibration Error, reliable ordering of "probably wrong" vs. "probably right" extractions) is not disclosed. The post cites confidence-elicitation literature ([4], [5]) but doesn't report PARSE's own calibration metric.
"Confidence score" threshold for HITL routing not shared. The rule "low confidence → human review" is stated but the specific cutoff and its trade-off (review budget vs. missed errors) is not.
Duplicate-product similarity function is still future work. The LLM-approximation cache is a plan, not a shipped component. If the similarity function false-positives, the cache serves wrong attribute values with no LLM call — which is worse than cache-miss latency.
Prompt-batching cost/quality trade-off also future work. Batching multiple products in a single prompt risks per- product quality drop from context dilution; no numbers yet.
No fine-tune vs. zero-shot comparison. All PARSE extraction appears to be zero-shot or few-shot prompted — there's no disclosure of whether DreamBooth-style fine-tuning (used by sibling PIXEL) is on the roadmap for attribute extraction.
No latency / throughput figures. The post focuses on per-attribute engineering time (days) and per-product quality (accuracy). Nothing on prod throughput, p95 latency, daily LLM token spend, or rate limits from provider APIs.

Source¶

systems/instacart-parse — the named platform
systems/instacart-pixel — sibling image-generation platform; same self-serve + model-agnostic architectural stance applied to a different modality (image generation vs. structured attribute extraction)
concepts/multi-modal-attribute-extraction
concepts/llm-self-verification
concepts/llm-cascade
concepts/llm-as-judge
concepts/iterative-prompt-refinement
concepts/few-shot-prompt-template
concepts/self-serve-generative-ai
concepts/model-agnostic-ml-platform
patterns/llm-attribute-extraction-platform
patterns/low-confidence-to-human-review
patterns/human-in-the-loop-quality-sampling
patterns/multi-attribute-multi-product-prompt-batching
patterns/llm-extraction-cache-by-similarity
companies/instacart