Skip to content

PATTERN Cited by 1 source

LLM attribute extraction before embedding

Pattern

Use a lightweight LLM to extract structured attributes from noisy raw product text before passing it to an embedding model. The LLM strips marketing copy, normalizes attribute representations, and produces a clean structured input that an off-the-shelf embedding model can encode without being misled by noise.

Quote (Source: sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale):

"Standardize before you embed. Lightweight attribute extraction is a high-ROI preprocessing step that reduces noise throughout the pipeline."

The two-stage pipeline

Stage 1: LLM attribute extraction

Run each product through a fast / cheap LLM with a prompt that extracts structured attributes. Instacart uses Gemini Flash ("~10x faster, ~5x cheaper than full-size models") and extracts:

  • Product type
  • Key ingredients
  • Dietary tags
  • Format

The LLM is also instructed to strip marketing copy and metadata that would distract the downstream embedder.

Stage 2: Embedding model

Pass the cleaned, structured representation to an off-the-shelf embedding model. Instacart uses Gemma (general-purpose embedding) — a model that, fed raw catalog text, would underperform domain-specific embeddings, but fed LLM-cleaned-attribute-representations, captures nuances that a domain-specific model misses:

"The goal is to test whether a general-purpose model, given cleaner inputs, can capture nuances that a domain-specific model misses."

Why this is high-ROI

The pattern flips the usual choice between domain-specific embedding (trained on noisy domain data, robust to noise but narrow) and general-purpose embedding (clean training, broader representational range but vulnerable to domain noise). Instead of choosing between them, you make the general-purpose embedding domain-robust by upstream cleaning.

The economic argument:

Component Cost What it provides
Domain-specific embedding model (training + serving) High Domain robustness; narrow representational range
General-purpose embedding model (off the shelf) Low Broader representational range; vulnerable to domain noise
General-purpose embedding + LLM upstream cleaning Lower than dedicated embedding training (because Gemini Flash is cheap), with broader representational range The best of both, when the LLM cleaning step is fast / cheap enough

The Gemini-Flash specifics matter: Instacart names "~10x faster, ~5x cheaper" — a domain where the LLM-preprocessing cost is bounded to be small relative to the embedding-model serving cost.

When this pattern composes

The Instacart application is in the discovery flavor of the two-flavor codebook pattern: the precision flavor uses a domain-specific embedding directly (no LLM step); the discovery flavor uses LLM-cleaned + off-the-shelf embedding.

The pattern composes well with:

  • Sparse-input domains — where raw text is high-noise (catalog descriptions, user-generated content, product listings). The LLM extracts signal; the embedder doesn't have to fight noise.
  • General-purpose embedding models — where a single embedding model serves many domains; the LLM cleaning step adapts the generic embedder to the domain.
  • Two-flavor / multi-flavor setups — where one flavor wants domain-specific tightness (no LLM step) and another wants general-purpose breadth (LLM-cleaned).

Generalization beyond grocery

Domain Raw input noise LLM-extracted structured attributes
E-commerce / grocery (Instacart) Marketing copy, brand fluff, inconsistent metadata Product type, ingredients, dietary tags, format
Job listings HR boilerplate, redundant qualifications Required skills, level, role, location
Real estate Marketing language, redundant descriptions Sq ft, bedrooms, location, amenities
Medical literature Verbose abstracts, inconsistent terminology Disease, drug, dosage, study type
Restaurant reviews Anecdotal noise, formatting variance Cuisine, price tier, dish list, sentiment
Bug reports Conversational noise, incomplete repro steps Component, severity, repro steps, error type

Caveats

  • LLM-cleaning cost is workload-dependent. Gemini Flash is cheap; full-size Gemini at the same volume might exceed the embedding-model cost. The pattern only works when the cleaning step's cost is bounded by efficient LLMs.
  • LLM extraction can hallucinate attributes. Strict prompt engineering + structured output schemas + validation are necessary; otherwise the embedding consumes hallucinated noise.
  • Drift in extraction quality — LLM provider model updates can change extraction behavior in subtle ways; codebook reproducibility may suffer.
  • Loss of information — by stripping marketing copy, the LLM removes signal that might have been useful (e.g. brand-prestige cues encoded in marketing voice). Trade-off depends on the flavor's intended cluster character.
  • Latency budget — LLM call adds latency to the pipeline; only acceptable when the embedding step is offline / batched (which it is for codebook training) or when the LLM is fast enough.
  • Attribute schema choice is a design decision — the four Instacart attributes (product type, ingredients, dietary tags, format) reflect a grocery-specific design; other domains need domain-specific schema design.
  • Prompt template not disclosed — what specifically the LLM is asked to do, what stripping rules apply, what fallback handling exists for products that don't match the schema.

Seen in

  • sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale — canonical wiki instance: Gemini Flash extracts structured attributes (product type, key ingredients, dietary tags, format)
  • strips marketing copy, then Gemma embeds the cleaned representation. The discovery flavor of Instacart's two-flavor codebook design. Quote: "Standardize before you embed. Lightweight attribute extraction is a high-ROI preprocessing step that reduces noise throughout the pipeline."
Last updated · 542 distilled / 1,571 read