Skip to content

SYSTEM Cited by 1 source

Instacart Intent Engine

The Intent Engine is Instacart's LLM-backed Query Understanding (QU) system, replacing a bespoke multi-model ML stack with a three-lever hierarchy — prompting → context-engineering (RAG) → fine-tuning — applied per QU sub-task. (Source: sources/2025-11-13-instacart-building-the-intent-engine)

Positioning

QU sits upstream of search retrieval + ranking. Its job: turn messy user queries ("bread no gluten", "x large zip lock", "2% reduced-fat ultra-pasteurized chocolate milk") into structured intent signals (categories, rewrites, tags) that downstream retrieval + ranking can use. Legacy QU at Instacart was "notoriously difficult" for long-tail queries and suffered from:

  • Broad queries ("healthy food", "frozen snacks") spanning dozens of categories.
  • No direct feedback. QU is upstream of clicks/conversions; pseudo-labels derived from user behaviour are noisy.
  • Tail queries with data sparsity and no click history.
  • System complexity. Separate FastText classifier + session-mined rewrites + SRL tagger, each with its own data pipeline, training, and serving infra.

The Intent Engine replaces this patchwork with a single LLM substrate applied differently per sub-task, and consolidates the "feature engineering" posture of the legacy stack into a "productionize the backbone" posture. (Source: sources/2025-11-13-instacart-building-the-intent-engine)

The three-lever hierarchy

Instacart's stated ordering, least to most invasive:

  1. Prompting — fast to iterate, the model only sees what's in the prompt.
  2. Context-engineering (RAG) — retrieve Instacart-specific signals (conversion history, catalog, brand-similarity embeddings) and inject them into the prompt.
  3. Fine-tuning — embed domain expertise into the weights (LoRA adapters + adapter merge).

The three-lever ordering is also a cost-control ordering: "To manage costs and prove value, we began with an offline LLM pipeline on high-frequency 'head' queries. This cost-effective approach handled the bulk of traffic and generated the data needed to later train a 'student' model for the long tail." Start offline, graduate to real-time strategically. (Source: sources/2025-11-13-instacart-building-the-intent-engine)

Three QU sub-tasks rebuilt

1. Query category classification

Legacy: flat multi-class FastText model on noisy conversion labels. Pitfalls: emits taxonomically inconsistent pairs ("Dairy", "Milk" as peers not parent/child), can't reason about novel compositions ("vegan roast").

New, three-step: (a) retrieve top-K historically converted categories as candidates → (b) LLM re-ranks with injected Instacart context → (c) semantic-similarity guardrail discards (query, category) pairs below a relevance threshold. The LLM is a re-ranker over a pre-filtered candidate set, not an open-universe classifier — keeps recall bounded and precision high.

2. Query rewrites

Legacy: session-behavior mining. Coverage only ~50% of search traffic; often emitted synonyms that weren't useful for recall expansion (for "1% milk""one percent milk").

New: three specialised prompts per rewrite type — Substitutes, Broader queries, Synonyms — each with chain-of-thought + few-shot exemplars + a post-processing semantic-relevance guardrail. Outcome: >95% coverage at 90%+ precision across all three types. Building on this, Instacart adds session-level context engineering (top-converting product categories from the user's subsequent in-session searches) to make rewrites personalized. (Source: sources/2025-11-13-instacart-building-the-intent-engine)

3. Semantic Role Labeling (SRL) — the hybrid system

SRL extracts structured concepts from a query (product, brand, attributes) used downstream for retrieval, ranking, ad targeting, and filters. Traffic is power-law — head queries can be precomputed; the tail cannot because it's "effectively infinite." Architectural decomposition:

  ┌─────────────────────────┐       ┌─────────────────────┐
  │ Offline RAG "teacher"   │──────▶│  Head-query cache   │──▶ live traffic (98%)
  │ pipeline                │       └─────────────────────┘
  │ • conversion data       │
  │ • catalog               │       ┌─────────────────────┐
  │ • brand embedding       │──────▶│ Training dataset    │──▶ trains 8B student
  │ • frontier LLM          │       └─────────────────────┘           │
  │ • post-proc guardrail   │                                          ▼
  └─────────────────────────┘                        ┌─────────────────────┐
                                                     │ Real-time Llama-3-8B│──▶ cache-miss traffic (2%)
                                                     │ + LoRA + adapter    │     "~300 ms on H100"
                                                     │ merge, on H100      │
                                                     └─────────────────────┘

Key properties:

  • The offline teacher pipeline is dual-purposed: its output populates both the live cache AND the student's supervised training set. Without dual purpose, you pay twice or ship a student trained on lower-quality labels. See patterns/offline-teacher-online-student-distillation.
  • The student is a LoRA fine-tune of Llama-3-8B. Reported "precision 96.4% vs 95.4% baseline, recall 95.0% vs 96.2%, F1 95.7% vs 95.8%" — parity F1, precision-biased. Deployment posture: precision over recall (a precise tag is more useful than a noisy one for downstream retrieval).
  • Latency path: out-of-box ~700 ms on A100 → 300 ms target after LoRA adapter merge + H100 upgrade. FP8 quantization gave another 10% but was not shipped due to a "slight drop in recall." GPU autoscaling at off-peak manages cost.
  • Cache-miss fraction: ~2% of queries hit the real-time student; ~98% served from cache. The 2% is the load-bearing number — the real-time 8B only pays its serving cost for 2% of traffic.

(Source: sources/2025-11-13-instacart-building-the-intent-engine)

Outcomes

  • 6% reduction in average scroll depth on tail queries (users find items faster).
  • 50% reduction in user complaints about poor tail-query search results.
  • >95% query-rewrite coverage at 90%+ precision (vs. 50% legacy coverage).
  • 300 ms target hit for real-time SRL on H100.
  • Millions of cold-start queries served weekly through the real-time SRL model.

Relationship to sibling Instacart platforms

  • PIXEL (image generation, 2025-07-17) — Intent Engine shares the prompt-template library + model-agnostic architectural posture (one prompt per use case, beats one general prompt).
  • PARSE (attribute extraction, 2025-08-01) — Intent Engine shares the offline+online hybrid + HITL-style post-processing guardrail stance applied to different data surfaces.
  • Maple (batch LLM processing, 2025-08-27) — Intent Engine's offline teacher pipeline is the kind of batch workload Maple is optimised for (CSV/Parquet in, CSV/Parquet out); PARSE + PIXEL + Intent Engine are three ML-platform-consolidation plays on different data/modality axes.

Caveats

  • No QPS / scale figures for the cache or the real-time 8B. "Millions of cold-start queries weekly" is the only scale disclosure.
  • Frontier teacher LLM is unnamed. The post compares the 8B student against "a much larger frontier model it learned from" without specifying which frontier model — matters for reproducibility.
  • LoRA hyperparameters unspecified (rank, target modules, dataset size, epochs).
  • Distillation is response-distillation — supervised fine-tuning on teacher labels — not soft-label / logit-matching in the strict academic Hinton sense. Terminology match is colloquial.
  • Context-aware QU is future work — not yet shipped. Post pitches distinguishing item-search / content-discovery / restaurant-search intents based on session context.
Last updated · 319 distilled / 1,201 read