Skip to content

SYSTEM Cited by 1 source

PARSE (Instacart's Product Attribute Recognition System for E-commerce)

Definition

PARSE is Instacart's internal self-serve, multi-modal LLM-based platform for structured product attribute extraction across its grocery catalog. PARSE replaces a prior patchwork of per-attribute SQL rules and bespoke text-only ML models with a single configurable service in which teams describe an attribute (name, type, description, prompt template, few-shot examples, input-data SQL, LLM choice) and the platform runs extraction across millions of SKUs, emitting both the extracted value AND a confidence score for downstream quality routing. (Source: sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms)

Announced 2025-08-01 on tech.instacart.com. Key contributors: Shishir Kumar Prasad, Matt Darcy, Paul Baranowski, Sonali Parthasarathy, DK Kwun, Peggy Men, Talha Maswala.

Acronym: Product Attribute Recognition System for E-commerce.

Why it exists

Prior to PARSE, Instacart used two attribute-creation approaches, each with a hard ceiling:

  • SQL rules — scalable but shallow. Fine for keyword extraction ("organic" in the description → organic: true); fails on anything needing context. Canonical failure: a product titled "Orange Drink" whose description lists variants "also available in Grape, Strawberry" — SQL cannot decide which is the primary flavor.
  • Text-only ML models — generalise via learned representations but every attribute needs its own labeled dataset, trained model, and maintained serving pipeline. This doesn't scale to thousands of attributes; worse, the whole stack is blind to image-only information (sheet count printed on packaging; ingredients listed only in the nutrition-facts panel).

"Achieving high-quality results for each attribute requires significant effort — from collecting and labeling specialized datasets to developing, training, and maintaining separate models and pipelines for every attribute of interest. This leads to a slower, more resource-intensive process as the catalog and attribute set grow. Both approaches also share a key limitation: they operate only on product text, leaving important gaps when attribute information is available solely in product images."

PARSE addresses both: (a) multi-modal LLM with zero/few-shot prompting means no per-attribute training pipeline; (b) LLM reasoning across text + image closes the text-only blind spot.

Architecture — four components

1. Platform UI (declarative, versioned config)

Users configure each step of an attribute-creation task via the UI:

  • Attribute definition — name, type (string, dict, number, boolean), natural-language description.
  • Extraction config — choice of LLM; choice of extraction algorithm; prompt template with placeholder slots for product features + attribute metadata.
  • Input-data SQL — what product features (title, description, category, nutrition panel text, image URLs, ...) feed the LLM, and how to pull them from the database.
  • Few-shot examples — optional exemplars injected into the prompt for style/format guidance (see concepts/few-shot-prompt-template).

All configurations are versioned — change history, author attribution, and rollback to a prior working config are first- class. This turns prompt iteration into a mergeable-config workflow rather than a code deploy.

"All of the configurations are versioned, allowing users to track changes, identify contributors, and revert to previous configurations if necessary."

A backend orchestration layer fetches products via the input SQL and dispatches them to the extraction endpoint.

2. ML Extraction endpoint

Per product, per attribute, the endpoint:

  1. Materialises the prompt by inserting the product's features + attribute definition into the template.
  2. Runs the selected extraction algorithm — the post mentions multiple supported algorithms so clients can balance cost vs. accuracy (see concepts/llm-cascade — the post calls one of these an "LLM cascade algorithm" in the cost-reduction discussion).
  3. Self-verifies the extracted value with a second entailment prompt — asking the LLM "is this extracted value correct based on the product features + attribute definition, yes or no?". The first generated token is constrained to yes / no; the logit of yes is read out and normalised into a probability — the confidence score. See concepts/llm-self-verification.

"We query the LLM with a second scoring prompt. The prompt will ask LLM to do an entailment task: asking LLM if the extracted attribute value by the extraction prompt is correct based on the product features and attribute definition. In the scoring prompt, we specifically ask LLM to output 'yes' or 'no' first. Then we can get the logit of the first generated token, and compute the token probability of 'yes' as the confidence score."

The post cites AutoMix (2023) ([2] in references) as the literature basis for the self-verification technique.

3. Quality Screening

Same framework, two modes:

Development mode — used while iterating on a prompt before production deploy:

  • Client uploads a small product sample.
  • Human auditors label gold attribute values via the built-in labeling UI.
  • LLM-as-judge (concepts/llm-as-judge) runs auto-eval to speed iteration between human-eval cycles.
  • Quality metrics computed; client decides whether to iterate the prompt further or promote to production.

Production mode — two orthogonal review populations:

  • Periodic random sample of new extractions → human + LLM-as-judge eval → catches systematic drift (e.g. a new brand family the LLM has never seen, or a prompt that was fine on launch products but degrades on newer SKUs). See patterns/human-in-the-loop-quality-sampling.
  • Proactive low-confidence triage — extractions with confidence below threshold are routed to human review for correction before reaching the catalog. See patterns/low-confidence-to-human-review.

Two populations, two failure modes caught — the random-sample loop catches failures the confidence score itself misses (calibration failure); the confidence-triage loop catches known-uncertain outputs the random sample would rarely pick up.

4. Catalog ingestion

Final extraction results are written into the Instacart catalog data pipeline. No internals disclosed beyond the hand-off point.

Named architectural insights

Multi-modal reasoning closes the text-only blind spot

Two canonical PARSE examples cited in the post:

  • Dry sheet / household product: text lists no sheet count; the packaging image shows "80 sheets". Only a multi-modal LLM recovers the value.
  • Multi-pack product: description reads "3 boxes of 124 tissues" — no total-sheet-count is stated. LLM must multiply 3 × 124 = 372 to recover it.

Quantified: on sheet_count, text-only LLMs gave a significant jump over legacy SQL; multi-modal LLMs added another +10% recall on top, capturing cases where the value exists only in the image or requires cross- reference between text and image.

Per-attribute prompt-tuning effort + LLM size trade-off

PARSE makes this a first-class decision per attribute:

Attribute Complexity PARSE iteration time Cheap-LLM cost vs. quality
organic claim Simple (is "organic" claimed?) 1 day (95% accuracy on first prompt; vs. 1 week with traditional approach) -70% cost at equivalent quality
low_sugar claim Complex (numeric + category rules, often implicit in nutrition image) 3 days (multiple iterations) -60% accuracy with cheap LLM

Implication: a platform that locks you into one LLM leaves money on the table for simple attributes AND gives up quality on hard ones. PARSE exposes LLM choice as per-attribute config, not a platform default.

Relation to sibling PIXEL

PARSE and PIXEL share architectural DNA:

  • Both are internal self-serve platforms consolidating what was previously per-team fragmented LLM/gen-AI work.
  • Both ship defaults-with-overrides prompt/config UX — see concepts/self-serve-generative-ai.
  • Both decouple model choice from caller code — see concepts/model-agnostic-ml-platform.
  • Both use an LLM-in-the-loop evaluator (PIXEL uses a VLM for image quality; PARSE uses an LLM-as-judge for extraction correctness).
  • Key difference: PIXEL's output is image pixels scored by a VLM rubric; PARSE's output is structured attribute values scored by a logit-based confidence score + human-in-the-loop sampling.

Applied twice in two modalities, the same architectural stance works.

Ongoing / future work (not shipped yet)

  • Multi-attribute batching: extract attributes A, B, C in a single prompt per product to amortise the shared product- feature context.
  • Multi-product batching: extract one attribute across products P1, P2, P3 in a single prompt to amortise the shared attribute-definition context. See patterns/multi-attribute-multi-product-prompt-batching.
  • LLM approximation via similarity cache: skip LLM calls for products "similar enough" to one whose attribute is already cached. Blocked on a product-similarity / duplicate-detection function — acknowledged as "a challenging problem". See patterns/llm-extraction-cache-by-similarity.
  • Automatic prompt tuning: cites recent literature on LLM-as-prompt-optimiser ([6] "LARGE LANGUAGE MODELS AS OPTIMIZERS", [7] "EVOPROMPT", [8] multi-stage prompt pipelines) as the direction to remove the human-in-the-loop prompt-iteration bottleneck.

Caveats

  • No confidence-score calibration disclosed. The raw yes-logit probability is used as the confidence score; whether it's well-calibrated at the Instacart-scale / Instacart-distribution level is not reported.
  • Low-confidence threshold and review budget not shared.
  • LLM approximation cache + prompt batching are plans, not shipped.
  • No latency / throughput / cost numbers. Per-attribute engineering time is the headline metric; no prod throughput, p95 latency, or daily token spend.
  • No fine-tune discussion. All extraction appears to be zero-shot or few-shot prompted.

Seen in

Last updated · 319 distilled / 1,201 read