SYSTEM Cited by 1 source
Instacart Intent Engine¶
The Intent Engine is Instacart's LLM-backed Query Understanding (QU) system, replacing a bespoke multi-model ML stack with a three-lever hierarchy — prompting → context-engineering (RAG) → fine-tuning — applied per QU sub-task. (Source: sources/2025-11-13-instacart-building-the-intent-engine)
Positioning¶
QU sits upstream of search retrieval + ranking. Its job: turn messy user queries ("bread no gluten", "x large zip lock", "2% reduced-fat ultra-pasteurized chocolate milk") into structured intent signals (categories, rewrites, tags) that downstream retrieval + ranking can use. Legacy QU at Instacart was "notoriously difficult" for long-tail queries and suffered from:
- Broad queries ("healthy food", "frozen snacks") spanning dozens of categories.
- No direct feedback. QU is upstream of clicks/conversions; pseudo-labels derived from user behaviour are noisy.
- Tail queries with data sparsity and no click history.
- System complexity. Separate FastText classifier + session-mined rewrites + SRL tagger, each with its own data pipeline, training, and serving infra.
The Intent Engine replaces this patchwork with a single LLM substrate applied differently per sub-task, and consolidates the "feature engineering" posture of the legacy stack into a "productionize the backbone" posture. (Source: sources/2025-11-13-instacart-building-the-intent-engine)
The three-lever hierarchy¶
Instacart's stated ordering, least to most invasive:
- Prompting — fast to iterate, the model only sees what's in the prompt.
- Context-engineering (RAG) — retrieve Instacart-specific signals (conversion history, catalog, brand-similarity embeddings) and inject them into the prompt.
- Fine-tuning — embed domain expertise into the weights (LoRA adapters + adapter merge).
The three-lever ordering is also a cost-control ordering: "To manage costs and prove value, we began with an offline LLM pipeline on high-frequency 'head' queries. This cost-effective approach handled the bulk of traffic and generated the data needed to later train a 'student' model for the long tail." Start offline, graduate to real-time strategically. (Source: sources/2025-11-13-instacart-building-the-intent-engine)
Three QU sub-tasks rebuilt¶
1. Query category classification¶
Legacy: flat multi-class FastText model on noisy conversion labels. Pitfalls: emits taxonomically inconsistent pairs ("Dairy", "Milk" as peers not parent/child), can't reason about novel compositions ("vegan roast").
New, three-step: (a) retrieve top-K historically converted categories as candidates → (b) LLM re-ranks with injected Instacart context → (c) semantic-similarity guardrail discards (query, category) pairs below a relevance threshold. The LLM is a re-ranker over a pre-filtered candidate set, not an open-universe classifier — keeps recall bounded and precision high.
2. Query rewrites¶
Legacy: session-behavior mining. Coverage only ~50% of search traffic; often emitted synonyms that weren't useful for recall expansion (for "1% milk" → "one percent milk").
New: three specialised prompts per rewrite type — Substitutes, Broader queries, Synonyms — each with chain-of-thought + few-shot exemplars + a post-processing semantic-relevance guardrail. Outcome: >95% coverage at 90%+ precision across all three types. Building on this, Instacart adds session-level context engineering (top-converting product categories from the user's subsequent in-session searches) to make rewrites personalized. (Source: sources/2025-11-13-instacart-building-the-intent-engine)
3. Semantic Role Labeling (SRL) — the hybrid system¶
SRL extracts structured concepts from a query (product, brand, attributes) used downstream for retrieval, ranking, ad targeting, and filters. Traffic is power-law — head queries can be precomputed; the tail cannot because it's "effectively infinite." Architectural decomposition:
┌─────────────────────────┐ ┌─────────────────────┐
│ Offline RAG "teacher" │──────▶│ Head-query cache │──▶ live traffic (98%)
│ pipeline │ └─────────────────────┘
│ • conversion data │
│ • catalog │ ┌─────────────────────┐
│ • brand embedding │──────▶│ Training dataset │──▶ trains 8B student
│ • frontier LLM │ └─────────────────────┘ │
│ • post-proc guardrail │ ▼
└─────────────────────────┘ ┌─────────────────────┐
│ Real-time Llama-3-8B│──▶ cache-miss traffic (2%)
│ + LoRA + adapter │ "~300 ms on H100"
│ merge, on H100 │
└─────────────────────┘
Key properties:
- The offline teacher pipeline is dual-purposed: its output populates both the live cache AND the student's supervised training set. Without dual purpose, you pay twice or ship a student trained on lower-quality labels. See patterns/offline-teacher-online-student-distillation.
- The student is a LoRA fine-tune of Llama-3-8B. Reported "precision 96.4% vs 95.4% baseline, recall 95.0% vs 96.2%, F1 95.7% vs 95.8%" — parity F1, precision-biased. Deployment posture: precision over recall (a precise tag is more useful than a noisy one for downstream retrieval).
- Latency path: out-of-box ~700 ms on A100 → 300 ms target after LoRA adapter merge + H100 upgrade. FP8 quantization gave another 10% but was not shipped due to a "slight drop in recall." GPU autoscaling at off-peak manages cost.
- Cache-miss fraction: ~2% of queries hit the real-time student; ~98% served from cache. The 2% is the load-bearing number — the real-time 8B only pays its serving cost for 2% of traffic.
(Source: sources/2025-11-13-instacart-building-the-intent-engine)
Outcomes¶
- 6% reduction in average scroll depth on tail queries (users find items faster).
- 50% reduction in user complaints about poor tail-query search results.
- >95% query-rewrite coverage at 90%+ precision (vs. 50% legacy coverage).
- 300 ms target hit for real-time SRL on H100.
- Millions of cold-start queries served weekly through the real-time SRL model.
Relationship to sibling Instacart platforms¶
- PIXEL (image generation, 2025-07-17) — Intent Engine shares the prompt-template library + model-agnostic architectural posture (one prompt per use case, beats one general prompt).
- PARSE (attribute extraction, 2025-08-01) — Intent Engine shares the offline+online hybrid + HITL-style post-processing guardrail stance applied to different data surfaces.
- Maple (batch LLM processing, 2025-08-27) — Intent Engine's offline teacher pipeline is the kind of batch workload Maple is optimised for (CSV/Parquet in, CSV/Parquet out); PARSE + PIXEL + Intent Engine are three ML-platform-consolidation plays on different data/modality axes.
Caveats¶
- No QPS / scale figures for the cache or the real-time 8B. "Millions of cold-start queries weekly" is the only scale disclosure.
- Frontier teacher LLM is unnamed. The post compares the 8B student against "a much larger frontier model it learned from" without specifying which frontier model — matters for reproducibility.
- LoRA hyperparameters unspecified (rank, target modules, dataset size, epochs).
- Distillation is response-distillation — supervised fine-tuning on teacher labels — not soft-label / logit-matching in the strict academic Hinton sense. Terminology match is colloquial.
- Context-aware QU is future work — not yet shipped. Post pitches distinguishing item-search / content-discovery / restaurant-search intents based on session context.
Related¶
- systems/instacart-pixel / systems/instacart-parse / systems/maple-instacart — sibling Instacart ML-platform consolidations
- systems/llama-3-1 — base model family for the fine-tuned 8B student
- systems/nvidia-h100 / systems/nvidia-a100 — serving hardware
- concepts/query-understanding / concepts/semantic-role-labeling / concepts/long-tail-query
- concepts/context-engineering / concepts/lora-low-rank-adaptation / concepts/adapter-merging
- concepts/knowledge-distillation / concepts/quantization
- patterns/head-cache-plus-tail-finetuned-model / patterns/offline-teacher-online-student-distillation / patterns/teacher-student-model-compression / patterns/prompt-template-library
- companies/instacart