Instacart¶

Instacart Engineering is a Tier-2 source on the sysdesign-wiki. Instacart is a US-based grocery delivery + pick-up platform; their engineering blog covers ML-for-catalog, search + recommendations, generative-AI applications to grocery imagery, ads platform, and customer-support automation.

Key systems¶

systems/instacart-generative-ads-retrieval + systems/instacart-semantic-ids + systems/instacart-contextual-recommendations + systems/instacart-griffin-2 — Generative ads retrieval (2026-06-02): Instacart rebuilt the candidate-generation stage of Carrot Ads on browse surfaces (retailer home page + pre-checkout) by replacing a BERT-based scoring model ( Contextual Recommendations / CR — "two years ago, we introduced … a BERT-based sequence model powering retrieval for both ads and organic recommendations across all major browse surfaces") with a generative model that spells out the next product token-by-token via beam search. The new vocabulary is Semantic IDs (SIDs): short codeword sequences (e.g. 35_7_120_184) generated by an RQ-VAE where semantically similar products share prefixes — see concepts/atomic-product-id-vs-semantic-id for the substrate trade-off. The architecture follows TIGER (Google DeepMind, NeurIPS 2023) — recently shipped in production by Spotify (GLIDE/NEO) and YouTube (PLUM) — adapted to grocery's distinctive "shopping list spans fresh food to cleaning supplies and pet care all within a single session" shape via a three-segment prompt template (retailer-type token + user-history SIDs + cart SIDs). At serve time, beam search produces several distinct SID sequences; the retailer-partitioned index maps each generated SID to available, attributed ad products. The substrate change is the architectural load-bearing piece: Semantic IDs cover every catalog item from day 1 (dissolves cold-start), share prefixes for hierarchical retrieval discipline (fixes flat-distribution outlier leakage), and reduce the embedding parameter space by 125× (escapes the vocabulary bottleneck). Operational outcomes: ~2× more candidates per request at −10–17% mean latency; +5% CTR, +34% add-to-carts (post calls "step-function increase"); 2.7× more brands, 1.8× more sub-categories retrieved; category-conditional diversity lifts +421% Alcohol / +396% Beverages / +229% Healthcare. Serving substrate is a brand-new GPU stack: TensorRT-LLM + NVIDIA Triton Inference Server + Go-native service shell hosted on Griffin 2.0, replacing the legacy "Python and CPU inference" stack — see patterns/gpu-serving-stack-tensorrt-llm-triton + patterns/go-native-ml-serving. Two runtime tunable diversity dials ( beam width + temperature) let the same model serve multiple surfaces with different precision-vs-exploration trade-offs without retraining. Three structural ceilings of the prior CR scoring model are dissolved by the generative substrate: vocabulary bottleneck (codebook-bounded vs catalog-bounded), cold-start hurdle (codebook coverage from day 1), structural drift (autoregressive prefix conditioning enforces semantic neighbourhood). Seventh Instacart ML-platform story on the wiki — alongside PIXEL / PARSE / Maple / Lace / Generative Recommendations / Carrot Ads — this one extends the pattern-graph into the generative- retrieval axis distinct from the scoring-with-DAL axis (systems/instacart-carrot-ads-pctr-model) on which Carrot Ads' ranker still runs over this CG's output. Canonical wiki instance of concepts/generative-retrieval + concepts/semantic-id + concepts/atomic-product-id-vs-semantic-id + concepts/vocabulary-bottleneck + concepts/beam-search-retrieval + concepts/retailer-partitioned-index + concepts/diversity-via-beam-and-temperature + patterns/generative-over-scoring-retrieval + patterns/rq-vae-codebook-as-product-vocabulary + patterns/context-template-prompt-with-special-tokens + patterns/beam-search-with-retailer-partitioned-mapping + patterns/gpu-serving-stack-tensorrt-llm-triton + patterns/go-native-ml-serving; sibling architectural alternative to Meta SilverTorch's Index-as-Model paradigm — both posts deeply rethink the scoring retrieval shape; SilverTorch keeps two-tower asymmetric pre-compute but absorbs the ANN index into the model graph as a tensor; Instacart abandons two-tower / ANN entirely and replaces it with autoregressive generation.
systems/instacart-carrot-ads + systems/instacart-carrot-ads-pctr-model — Carrot Ads (2026-05-04): Instacart's omnichannel retail-media platform letting retailer partners run their own ad businesses on either their owned-and-operated (O&O) sites/apps or on Instacart- hosted whitelabel Storefront properties. Demand pool spans retailer-sourced advertisers + Instacart-sourced demand from 7,500+ CPG brands. The auction is real-time; ranking is driven by a Wide-and-Deep pCTR model. Each new partner triggers a new-partner cold-start problem solved by Domain Adaptive Learning — a subset of transfer learning — with two simultaneous adaptation layers: (1) neural-network level (shared shopping-context-pre-trained embedding layers, feature transfer wide ⇄ explicit + deep ⇄ pre-trained dense, selective fine-tuning, generalization via reuse — see patterns/cross-domain-warm-start-via-shared-embeddings) and (2) training-data level (Marketplace-as-source corpus, catalog taxonomy alignment, per-partner feature trimming for real-time auction latency). Counter- intuitive disclosed property: DAL outperforms from-scratch training even when the target partner has enough data — "because of the benefits from Instacart's first party data". Reported lift across search ads + product category ads (CTR / clicks-per-user / ads revenue, no specific numbers disclosed). Gating risk: negative transfer, guarded by HITL schema mapping + alignment verification today; planned automated Domain Adaptation Platform for domain-shift detection ahead. Sixth Instacart ML-platform story on the wiki (alongside PIXEL / PARSE / Maple / Lace / Generative Recommendations) — same *"platformise a specific ML capability
ride proven backbones" arc, applied to the retail-media vertical. Canonical wiki instance of concepts/domain-adaptive-learning + concepts/transfer-learning + concepts/source-and-target-domain + concepts/wide-and-deep-architecture + concepts/negative-transfer + concepts/feature-taxonomy-alignment + patterns/cross-domain-warm-start-via-shared-embeddings + patterns/per-partner-feature-trimming-for-auction-latency; extends concepts/cold-start with the new-domain / new-partner* sub-case.
systems/instacart-generative-recommendations-platform + systems/instacart-shopping-hub — Generative recommendations platform (2026-02-26): early-stage rebuild of Instacart's Shopping Hub discovery surface on a new AI-native content-generation platform. Four-phase top-down cascaded pipeline — (1) page design + theme generation via LLM with constrained decoding against a structured schema (emits ordered themed placements + derived signals: user personas + freeform product concepts); (2) retrieval keyword generation via teacher-student fine-tune across Llama + Qwen families with LoRA plus RAG candidate pruning (~100 nearest neighbours from a 300,000-term keyword corpus → 15–20% all-in generation cost reduction); (3) quality + diversity filtering — embedding-similarity dedup + multi-level LLM-as-judge (page / placement / product) + fine-tuned DeBERTa cross-encoder classifying theme-product relevance at >99% cost reduction vs LLM inference (the economic unlock that lets evaluation become action at full-catalog scale) + business/policy guardrails; (4) **existing product
placement ranking infra, unchanged. Three named tenets — delightful personalization, cross-placement cohesion, adaptability. Instacart explicitly benchmarked top-down vs bottoms-up generation and picked top-down on all three tenets. Load-bearing architectural insight: cascaded decomposition is a cost + quality move, not a modelling move — the cascade opens the door to RAG + teacher-student + cross- encoder filtering that a single-step monolithic generator can't use. Three-prong eval framework (multi-level LLM-as-judge + fine-tuned DeBERTa at scale + classical ML/metric evaluators). Same "platformise generative AI + keep existing mature ranking infra" stance as PIXEL / PARSE / Maple. Fifth Instacart ML-platform story on the wiki — extends the pattern-graph into discovery content generation**; early-journey framing, no production A/B outcomes yet disclosed. Canonical wiki instance of concepts/generative-recommendations + patterns/top-down-cascaded-page-generation + patterns/rag-candidate-pruning-cascade + patterns/fine-tuned-cross-encoder-as-filter + patterns/llm-as-judge-multi-level-rubric.
systems/lace-instacart — LACE (LLM-Assisted Chatbot Evaluation) (2025-06-11): Instacart's internal offline- evaluation framework for the customer-support chatbot. Scores every evaluated chat session against a binary True/False rubric across five dimensions (Query Understanding / Answer Correctness / Chat Efficiency / Client Satisfaction / Compliance) with per-criterion rationale retention. Benchmarked three engines — direct prompting / agentic reflection / agentic debate — and picked debate (Customer + Support + Judge sub-agents; Customer and Support parallel with no cross-talk; Judge synthesises) as the production engine. "Near-perfect accuracy" on simple Compliance criteria; >90% accuracy on context-dependent criteria with Instacart-specific knowledge embedded in a static prompt template (RAG-retrieval at evaluator named as future work). Subjective criteria kept only as directional-check regressions — "a low-ROI path" to refine. Two load-bearing implementation decisions: (a) decouple reasoning from structured output — strong reasoner (o1-preview at writing time) emits free-form rationale → separate step emits JSON — escapes restricted- decoding quality loss; (b) human-LACE alignment loop bootstraps the rubric + regression-tests every update; two-lever hierarchy (criteria-prompt refinement frequent; criteria-structure redesign rare). Production deployment uses stratified sampling by topic distribution → dashboards → direct feedback into Instacart's experimentation platform. Seventh Instacart source on the wiki — first wiki instance of chatbot evaluation infrastructure at Instacart, extending the "stop every team from DIY'ing this" platform-consolidation thesis from image generation (PIXEL) / structured extraction (PARSE) / batch LLM (Maple) / query-understanding (Intent Engine) / mobile-UI migration (Caper) / ML-training data (Capsight) into the LLM-quality-measurement axis. Shares architectural DNA with the earlier VLM-as-judge loop in PIXEL (Shishir Kumar Prasad is a contributor on both) and the LLM-as-judge quality screening in PARSE. Parallel-play sibling to Lyft LLM-as-judge localisation and Zalando search AI-as-judge — three Tier-2/3 companies shipping the same LLM-as-judge → dashboard → experimentation architecture to different customer-facing surfaces at the same time.
systems/capsight + Caper — Capsight edge-to-cloud data flywheel (2026-02-17): Instacart's ML data platform for the Caper smart-cart fleet. Three components — Collector (on-device agent, trigger- based capture on activity signal + recognised barcode, dedicated hardware video encoder for zero AI-task regression, resilient uploader with storage-threshold pause + auto-cleanup to protect the retailer store network), Depot (cloud ingestion + indexing + searchable web UI + VLM + teacher-model pre-labelling with human correction rather than human-from-scratch annotation), and Learner (Ray-based distributed training + automated evaluation gate). Outcomes vs pre-Capsight baseline: >70% annotation cost reduction; multi-day labelling tasks → hours; model training stage 1 week → 2 days; end-to-end iteration cycle 1 month → 1 week; >5% model accuracy improvement within weeks of deployment. Canonical wiki instance of concepts/edge-cloud-data-flywheel + concepts/production-data-diversity + patterns/distributed-fleet-as-data-pipeline + patterns/trigger-based-edge-capture + patterns/vlm-assisted-pre-labeling + patterns/resilient-edge-uploader. First wiki instance of edge-fleet-as-ML-data-pipeline at Instacart (sixth platform-consolidation play after PIXEL, PARSE, Maple, Intent Engine, AI Gateway / Cost Tracker — but this one sits upstream of training rather than at serving time, and crosses the edge / cloud boundary).
systems/instacart-flyer-digitization-pipeline + systems/segment-anything-model-sam — Flyer digitization pipeline (2026-02-09): the internal computer- vision + LLM system converting retailer-supplied weekly grocery flyer images into tap-to-shop interactive tiles on the Instacart app. Replaces a manual bounding-box-and-match workflow (3–4 hours per flyer, "hundreds of hours each week" across retailers) with a two-phase pipeline (<30 minutes end-to-end). Phase 1 — Image Segmentation: hybrid detector choice tiered by flyer complexity — simple flyers use iterative-grid multimodal-LLM probing (~90% accuracy); complex flyers use SAM as base detector with four post-processing stages: (1) text-box removal, (2) Weighted Boxes Fusion to merge overlapping boxes (explicitly rejecting NMS as "may discard valuable information"), (3) model ensembling with classical contour detection gated per-retailer on flyer density, (4) heuristic + ML filters on aspect ratio
size. Phase 2 — Product Identification: OCR + LLM + internal catalog search to match each box to a SKU; captured-body truncation means Phase-2 details are at component-level only. Named failure mode: FoodSAM (food- specific SAM variant) "fell short of addressing the breadth and variety of products featured in retail flyers." Canonical wiki instances of patterns/hybrid-cv-plus-llm-pipeline
patterns/complexity-tiered-model-selection. Fourth Instacart visual-ML system on the wiki alongside PIXEL (generation), PARSE (attribute extraction), and the Caper mobile-UI migration — same "decompose the problem, match model to sub-task, route by complexity" engineering stance.
systems/jetpack-compose + systems/android-fragment + systems/paparazzi — Caper smart-cart Android migration (2026-02-03): Instacart's in-store scan-and-pay smart cart (stability-critical hardware — "a crash can lead to cart abandonment") Android app migrated from Fragments + XML layouts to Jetpack Compose in a four-phase plan: Phase 1 (implicit Fragment hosts via Google's navigation-fragment-compose, manual, seeds pattern knowledge); Phase 2 (type-safe Kotlin-DSL navigation, 30+ sub-graphs / 130+ destinations, iterative AI workflow, 5–7× speed increase, 300–350 engineering hours saved); Phase 3 (Fragments → Compose screens, 100+ features, 17-step AI skill with Paparazzi visual-parity engineer-verification checkpoints, progressive-disclosure context-window discipline); Phase 4 (Compose Navigation, in progress, feature-flagged dual-system rollout). Load-bearing architectural pattern: outer-parameterless / inner-testable Composable split (MyFeatureScreen() binds DI + nav, MyFeatureScreenInternal(...) is pure Compose with callbacks) established in Phase 1 makes the Phase-4 Compose-Navigation migration cheap. First Instacart mobile-platform source on the wiki + first wiki instance of AI-skill-driven Android UI framework migration. Canonical instances of patterns/phased-framework-migration + patterns/ai-migration-skill-workflow + patterns/visual-parity-screenshot-gate; concepts/ai-assisted-refactoring-economics + concepts/ai-instructions-as-code.
systems/instacart-intent-engine — Intent Engine (2025-11-13): Instacart's LLM-backed query-understanding system replacing a bespoke multi-model legacy stack (FastText classifier + session-mined rewrites + separate SRL). Three-lever adaptation hierarchy stated explicitly: prompting → context-engineering (RAG) → fine-tuning. Three QU sub-tasks rebuilt: (i) query category classification (retrieve top-K converted categories → LLM re-ranks with context → semantic-similarity guardrail filters); (ii) query rewrites with three specialised prompts — Substitutes / Broader / Synonyms — each with chain-of-thought + few-shot (>95% coverage at 90%+ precision, up from 50% legacy coverage); (iii) SRL via the load-bearing hybrid cache + real-time fine-tuned model pattern. SRL stack: offline RAG "teacher" pipeline (conversion history + catalog + brand-embedding similarity + frontier LLM) dual-purposed to populate a head cache AND train a Llama-3-8B + LoRA student; student is adapter-merged and served on H100 at ~300 ms (from ~700 ms out-of-box on A100). FP8 quantization gave another 10% but was not shipped due to a slight recall regression. Cache-miss fraction: ~2% of queries. Production outcomes: 6% reduction in average scroll depth on tail queries, 50% reduction in user complaints on tail-query search quality, millions of cold-start queries served weekly. Named strategic argument: "A generic LLM is a commodity; your business context is what makes your application defensible." Canonical wiki instances of patterns/head-cache-plus-tail-finetuned-model + patterns/offline-teacher-online-student-distillation. Third Instacart platform-consolidation play after PIXEL (image generation) and PARSE (attribute extraction).
systems/maple-instacart — Maple (2025-08-27): Instacart's internal batch-LLM processing service. CSV/Parquet in, CSV/Parquet out RPC; hides the LLM provider's 50K-prompt / 200 MB / 24 h batch API behind a single interface. Stack: Python + PyArrow + orjson + Temporal for durable execution + S3 + Parquet (claimed 25× vs CSV). Proxies through the internal AI Gateway which integrates with Cost Tracker for per-team attribution. Scales to 10M+ prompt jobs; reports ~50% cost reduction vs real-time calls and "hundreds of thousands of dollars per year to just thousands of dollars per year" on specific processes. Four-class failure taxonomy (expired / rate-limited / refused / invalid-image) with per-class retry policies (patterns/infinite-retry-by-failure-class). Extends the same CSV interface to real-time-only providers via patterns/batch-then-real-time-fallback. Canonical patterns/llm-batch-processing-service.
systems/instacart-ai-gateway — AI Gateway (2025-08-27): internal provider-abstraction + cost-tracking layer that every LLM call from Maple / PIXEL / PARSE flows through. Canonical internal-gateway instance of patterns/ai-gateway-provider-abstraction.
systems/instacart-cost-tracker — Cost Tracker (2025-08-27): per-team LLM usage/spend accounting, integrated into AI Gateway.
systems/instacart-pixel — PIXEL (2025-07-17): Instacart's unified internal image-generation platform. Single RPC service fronting a catalog of image-generation models; five architectural components (unified parameter protocol + few-shot prompt template library + DreamBooth fine-tunes on Stable Diffusion per product-category + VLM-based iterative quality evaluation + S3-plus- Snowflake infra). Reported outcomes: 10× team time-to-image reduction; 20% → 85% human-judge approval rate via the VLM evaluation loop; >25% reduction in Butcher Cuts add-to-cart time; 15% uplift in Lifestyle Imagery personalised-carousel cart conversion.
systems/instacart-parse — PARSE (2025-08-01): Product Attribute Recognition System for E-commerce. Self-serve, multi-modal LLM platform for structured catalog-attribute extraction. Four components: declarative + versioned Platform UI (attribute name / type / description / prompt template / few-shot examples / input-data SQL / LLM choice) → ML extraction endpoint emitting extracted-value + confidence score via entailment-prompt self-verification → Quality Screening with dev/prod modes (LLM-as-judge + human auditors + low-confidence HITL routing) → catalog ingestion. Reported outcomes: organic attribute 1 day (PARSE) vs. 1 week (traditional) at 95% accuracy; complex low_sugar iteration down to 3 days; multi-modal LLM +10% recall over text-only on sheet_count; -70% cost for cheap LLM on simple attributes / -60% accuracy for cheap LLM on hard attributes — motivating per-attribute model choice. Shares architectural DNA with PIXEL (self-serve, model-agnostic, LLM-evaluator-in-the-loop).

Key patterns / concepts¶

Chatbot / LLM evaluation (2025-06-11 LACE post)¶

patterns/multi-agent-debate-evaluation — canonical wiki instance at Instacart LACE: three-agent Customer + Support + Judge structure, parallel critics with no cross-talk, Judge synthesises. Wins for context-dependent and simple Compliance criteria. Cites Du et al. 2023 (arXiv:2305.14325).
patterns/self-reflection-llm-evaluation — the weaker alternative LACE benchmarked; single agent scores then reflects on its own verdict. Cites Madaan et al. 2022 + Jang 2023 + Madaan et al. 2024.
patterns/human-aligned-criteria-refinement-loop — canonical wiki instance at Instacart LACE: bootstrap + regression-test pattern for LLM-as-judge rubrics with two-lever hierarchy (criteria-prompt refinement frequent, criteria-structure redesign rare).
concepts/binary-vs-graded-llm-scoring — canonical wiki instance: Instacart explicitly benchmarked binary vs. 1-10 and chose binary for consistency + prompt-engineering cost
human-judgment alignment.
concepts/decouple-reasoning-from-structured-output — canonical wiki instance: two-pass design with o1-preview for free-form reasoning + cheaper step / parser for JSON; motivated by restricted-decoding quality loss.
concepts/llm-evaluation-dimensions — canonical wiki instance of chatbot-specific five-dimension rubric (Query Understanding / Answer Correctness / Chat Efficiency / Client Satisfaction / Compliance) with three-tier complexity grouping (simple / context-dependent / subjective).
concepts/human-llm-evaluation-alignment — canonical wiki instance of calibrating a judge to human ratings via iterative rubric refinement.
concepts/stratified-topic-sampling — canonical wiki instance: LACE's production sampling strategy for long- tailed support traffic.
concepts/co-star-prompt-framework — canonical wiki instance: all LACE evaluator prompts authored in Markdown with CO-STAR sectioning; prompt formatting treated as a measured first-order quality lever.

Flyer digitization (2026-02-09 flyer-digitization post)¶

patterns/hybrid-cv-plus-llm-pipeline — canonical wiki instance at Instacart: Phase 1 (purpose-trained segmentation
CV post-processing) decomposed from Phase 2 (OCR + LLM + catalog search). Localization separated from identification; each phase uses the best-in-class tool for its sub-problem.
patterns/complexity-tiered-model-selection — canonical wiki instance at Instacart: route simple flyers to iterative-grid multimodal-LLM probing (~90% accuracy); route complex flyers to the SAM + post-processing stack. Per-retailer density gating of the contour-detection ensemble branch is a second instance of the same pattern.
concepts/weighted-boxes-fusion — the Phase-1 box-merge technique; confidence-weighted coordinate averaging chosen over NMS because NMS discards lower-confidence information. Cited prior art: +3–10% mAP in medical-imaging ensembles.
concepts/non-maximum-suppression — the classical alternative Instacart explicitly rejected in favour of WBF for the reasons above.
concepts/model-ensembling-for-detection — Phase-1 ensemble of SAM-style segmentation with classical contour detection; contour branch gated per retailer based on flyer density — a dynamic, input-conditioned ensemble rather than always running every branch.
concepts/iterative-coordinate-grid-probing — the simple- flyer detector: overlay uniform grid → ask VLM for the first box's starting cell → subdivide → recurse. Works purely via prompting + image manipulation on an off-the- shelf VLM; ~90% accuracy on simple flyers, fails on complex ones.

AI-assisted mobile UI migration (2026-02-03 Jetpack Compose post)¶

patterns/phased-framework-migration — canonical wiki instance at Instacart Caper: four orthogonal phases (implicit Fragment hosts → type-safe nav → Fragment→Compose → Compose Navigation) each validated in production before the next; per-phase AI-involvement level calibrated to novelty + risk.
patterns/ai-migration-skill-workflow — canonical wiki instance at Instacart Caper Phase 3: 17-step AI skill in four stages (Analysis+Baselining / Compose Implementation / Verification+Integration / Cleanup) with engineer verification checkpoints; formalised from an earlier 325+ line markdown migration guide after 5–6 prior migrations.
patterns/visual-parity-screenshot-gate — canonical wiki instance: Paparazzi JVM-side screenshot baseline of the pre-migration Fragment informs the AI's Compose implementation; post-migration Paparazzi screenshot is diffed and engineer-reviewed; cleanup is gated on pixel-parity sign-off.
concepts/ai-assisted-refactoring-economics — canonical wiki instance: 5–7× speed increase, 300–350 engineering hours saved on Phase 2 alone; the thesis is that "the economics of technical debt have changed" and previously-deprioritized mechanical migrations are now feasible.
concepts/ai-instructions-as-code — canonical wiki instance: 325+ line migration guide "effectively a program that the AI executes," triple-duty (AI executes, humans checklist, reviewers verify), iterated like code over 5–6 migrations, formalised into a structured Agent Skill for progressive disclosure.
patterns/migration-as-agent-skill — cross-vendor extension: Cloudflare/vinext (2026-02-24) is the web- framework sibling, Instacart Caper (2026-02-03) is the Android-UI-framework sibling — same architectural shape applied to a different platform's framework migration.

Query understanding / Intent Engine (2025-11-13 Intent-Engine post)¶

patterns/head-cache-plus-tail-finetuned-model — canonical instance at Instacart: ~98% of queries served from a pre-computed head cache; ~2% (the tail) routed to a fine-tuned Llama-3-8B real-time student. The 98/2 split is the load-bearing economic number.
patterns/offline-teacher-online-student-distillation — the training-architecture counterpart. Instacart's offline RAG pipeline is dual-purposed — its output populates the live head cache and becomes the student's supervised training set. No duplicate pipeline cost.
patterns/teacher-student-model-compression — the more general shape; Intent Engine SRL is the LLM-serving instance complementing the prior on-device-CV instance (YouTube effects).
concepts/query-understanding — the parent concept; QU's three sub-tasks (classification, rewrites, SRL) are all rebuilt in the post.
concepts/semantic-role-labeling — the load-bearing QU sub-task where the hybrid-cache architecture lives.
concepts/long-tail-query — the traffic shape forcing the hybrid architecture; Instacart ships a 50% reduction in user complaints on the bottom 2% of queries.
concepts/context-engineering — extends the existing wiki framing (Fly.io / Dropbox / Datadog) into the retrieval-relevance axis: the post gives three concrete Instacart data streams injected into the teacher prompt (top converted brand + top converted categories + product-catalog brand embeddings) + a post-generation guardrail. "Context is the defensible moat."
concepts/lora-low-rank-adaptation — the fine-tuning mechanism for the Llama-3-8B student.
concepts/adapter-merging — the load-bearing latency move that got the student to 300 ms alongside an H100 upgrade.
concepts/knowledge-distillation — the academic framing; Instacart uses response distillation (supervised fine-tuning on teacher outputs) rather than soft-label Hinton-style distillation.
concepts/quantization — FP8 evaluated, rejected due to recall regression: canonical instance of latency-vs-quality trade-off resolved in favour of quality.

Batch LLM processing (2025-08-27 Maple post)¶

patterns/llm-batch-processing-service — canonical instance at Instacart: one platform fronting the LLM provider's batch API with CSV/Parquet-in, CSV/Parquet-out interface; Temporal-backed durable workflow; S3-Parquet intermediate storage; per-team cost accounting via AI Gateway.
patterns/batch-then-real-time-fallback — unified-interface extension to providers without batch APIs; auto-parallelisation
exponential backoff behind the same CSV interface.
patterns/infinite-retry-by-failure-class — class-specific retry policy keyed on the provider's four-class failure taxonomy (expired + rate-limited = infinite, refused = max 2×, invalid-image = optional with pre-check on retry #2).
patterns/csv-in-parquet-intermediate-output-merge — accept CSV at boundary, Parquet internally, output format mirrors input; 25× compression wins at scale.
concepts/llm-batch-api — the provider API surface Maple abstracts (50K / 200 MB / 24 h SLA / ~50% cost discount).
concepts/provider-failure-taxonomy — the four-class Maple framework for typed-failure dispatching.
concepts/stream-based-file-processing — memory-safety discipline load-bearing at 10M+ prompt scale.
concepts/cost-tracking-per-team — the AI-Gateway-level governance primitive.
concepts/durable-execution — Maple's Temporal-backed property; sharpens the motivation beyond crash recovery to cost protection (LLM batch APIs bill on submit, not on completion).

Structured attribute extraction (2025-08-01 PARSE post)¶

patterns/llm-attribute-extraction-platform — canonical instance at Instacart: one platform consolidating per- attribute SQL rules + per-attribute ML models into declarative LLM-driven config.
patterns/low-confidence-to-human-review — proactive error detection: low-confidence extractions route to human auditors before catalog ingestion.
patterns/human-in-the-loop-quality-sampling — orthogonal drift-detection loop: periodic random sample reviewed by humans + LLM-as-judge.
patterns/multi-attribute-multi-product-prompt-batching — future-work cost-reduction: batch attributes-per-product or products-per-attribute to amortise shared-context tokens.
patterns/llm-extraction-cache-by-similarity — future-work cost-reduction: cache extraction results keyed by product- similarity function (blocked on duplicate-product detection).
concepts/llm-self-verification — entailment prompt + yes-token logit → per-extraction confidence score. Cites AutoMix [2] as literature basis.
concepts/llm-cascade — per-attribute cheap-vs-expensive LLM choice; Instacart's 70% cost reduction on simple attributes and 60% accuracy drop on hard attributes is the motivating number.
concepts/multi-modal-attribute-extraction — cross-modal (text + image) reasoning for attributes like sheet_count whose value may be image-only or require text+image cross- reference. +10% recall over text-only.

Image-generation platform (2025-07-17 PIXEL post)¶

patterns/unified-image-generation-platform — canonical instance at Instacart: one platform fronting multiple models with unified parameter translation, prompt-template defaults, VLM quality gate, fine-tunes, and infra integration.
patterns/vlm-evaluator-quality-gate — the four-step loop (prompt-LLM → generate → VLM-judge → failed-questions-fed-back) that raised approval rate 20% → 85%.
patterns/prompt-template-library — per-application prompt templates with few-shot exemplars encoding lighting / background / composition defaults.
patterns/fine-tuned-model-per-product-category — DreamBooth fine-tunes for unbranded produce + meat categories.
concepts/unified-parameter-protocol — style / size / cfg_scale normalised across providers; model swap is a model-name string edit.
concepts/cross-model-portability — the consequence: "the best performing model varied project by project" so portability is load-bearing.
concepts/model-agnostic-ml-platform — platform stance.
concepts/self-serve-generative-ai — UI usable by anyone at Instacart regardless of technical background.
concepts/vlm-as-image-judge — the core quality-evaluation primitive.
concepts/iterative-prompt-refinement — the loop structure (4 steps: prompt → generate → score → feed-failed- questions-back).
concepts/few-shot-prompt-template — the prompt-template primitive.

Recent articles¶

2026-06-02 — Semantic IDs: Product Understanding at Scale → sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale — Instacart catalog-ML team (Shrikar Archak, Karuna Ahuja, Soroush Sobhkhiz, Marko Avdalovic, Xiyu Wang, JiChao Zhang, Hao Yan, Chris Hartley) deep-companion to the same-day From Scoring to Spelling ads-retrieval post — that post canonicalised the SID consumer (generative ads retriever); this post canonicalises the SID generator (RQ-VAE training methodology + intrinsic evaluation + downstream-uses roadmap). Five load-bearing disclosures: (1) catalog taxonomy as graded supervision for an RQ-VAE contrastive loss term — pair labels from tree-distance (same-leaf strong+ / sibling-leaf moderate+ / no-shared-ancestor −); explicitly framed as the cold-start-compatible alternative to PLUM's engagement-data approach ("using our catalog taxonomy as the supervision signal rather than engagement data (which isn't available for cold-start products)") — see concepts/contrastive-regularization-with-catalog-structure
patterns/contrastive-loss-via-taxonomy-tree. (2) hierarchical batch sampling to give the contrastive loss positive signal in each batch — pick parent → ~half batch from its children → rest from unrelated → multi-sample within each category slot, "no explicit pair labeling is needed — the catalog structure does the work" — see concepts/hierarchical-batch-sampling-for-contrastive-loss. (3) two flavors of codebooks — same RQ-VAE + contrastive loss + catalog supervision against different upstream embeddings: ESCI (precision) via the in-house ESCI search-relevance model (trained on query-product matching with Exact / Substitute / Complementary / Irrelevant labels) → tight substitute clusters (e.g. Whole Bean Coffee 0_8_55_72) → substitution / search / reordering; ESCI+Gemma (discovery) via Gemini Flash (~10× faster, ~5× cheaper) attribute-extraction (product type / ingredients / dietary tags / format) + marketing-copy stripping → off-the-shelf Gemma embedding → broader thematic clusters → homepage feeds / cross-selling / exploration; "Neither is universally better. The key is matching the right flavor to the right surface" — see concepts/precision-vs-discovery-codebook-flavor + patterns/two-flavor-codebook-precision-vs-discovery + patterns/llm-attribute-extraction-before-embedding. (4) intrinsic evaluation suite — three complementary metrics evaluate codes directly rather than only via downstream metrics: similarity-depth correlation (Spearman 0.69–0.84 between embedding cosine similarity and shared-prefix depth; ≥0.9-cosine pairs share L1 at 98–99% declining to 18–37% at L4) — see concepts/similarity-depth-correlation; LLM-based cluster evaluation scoring leaf clusters on functional coherence + purchase likelihood + customer journey relevance, validating that ESCI scores higher on substitutability while ESCI+Gemma scores higher on thematic coherence — see concepts/llm-based-cluster-evaluation; taxonomy alignment with disagreements becoming audit signal — see patterns/intrinsic-evaluation-of-discrete-codes. (5) catalog-audit dual-use — when SIDs disagree with taxonomy labels the label is often wrong (Protein Bar filed under Candy clusters with Sports Nutrition; Sparkling Water filed under Soda clusters with sparkling waters); the in-progress audit pipeline (mismatch flagging + cluster-fit confidence scoring + prioritized human-review queues) turns the recsys primitive into catalog-quality infrastructure; "What started as a recommendation primitive is becoming infrastructure for ongoing catalog health" — see concepts/code-vs-label-mismatch-as-catalog-audit + patterns/semantic-code-as-catalog-audit. Operational numbers disclosed: ~2,000 codeword tokens for the entire catalog (the concrete vocabulary-bottleneck-escape datum); 4 codewords per product at hierarchical granularity levels; λ = 0.01 contrastive loss weight; loss formula L_total = L_reconstruction + L_rq + λ · L_contrastive; Spearman 0.69–0.84 similarity-depth correlation; carousel A/B uplift confirmed: +34% add-to-carts, 2.7× more emerging brands surfaced, with "tail categories saw the largest gains, precisely because semantic IDs gave those products a representation the old model couldn't" — see concepts/tail-category-coverage (canonical wiki concept established this ingest, distinct from cold-start). Production cluster examples disclosed (under prefix 6_19_): 6_19_32 Italian cheeses; 6_19_24 specialty cheeses; 6_19_12 olives; 6_19_7 tapenades; 6_19_9 deli trays + dips; 6_19_14 croutons — "No one wrote a rule connecting Pecorino Romano to Kalamata olives to olive tapenade. The model learned that these products inhabit the same culinary universe." And finer-grain within 6_19_32: 6_19_32_4 fresh mozzarella; 6_19_32_16 blue cheeses; 6_19_32_63 hard Italian cheeses (Parmigiano, Pecorino, Asiago); 6_19_32_70 ricotta salata. Failure modes disclosed: two divergent-code cases (Riesling wines 0_19_52_63/0_31_52_88 at 0.86 cosine but L1-only match; team apparel 1_19_21_20/1_7_41_59 at 0.95 cosine but L1-only match) attributable to sparse text — "Products with rich descriptions and complete catalog metadata produce more stable codes." Spread of SIDs: now power product retrieval / replacement recommendations / next-item prediction; planned for product detail page recommendations / cart assistant suggestions / ranking features; "Looking ahead, we're bringing them to product detail page recommendations, cart assistant suggestions, and ranking features, particularly to address cold start where they have the most leverage." What's next: incorporating engagement-based contrastive signals (substitution patterns, co-purchase data) following PLUM's approach. Tenth Instacart source on the wiki and second to canonicalise the SID architecture (the consumer side was the first). Together the two posts establish Instacart's SID system as the most thoroughly canonicalised generative-retrieval substrate on the wiki alongside Meta SilverTorch.
2026-05-14 — Scaling Personalized Marketing for Multi-Tenant Commerce Platforms → sources/2026-05-14-instacart-scaling-personalized-marketing-for-multi-tenant-commerce-platforms — Instacart Engineering post (authors: Brent Scheibelhut, Ryan Martin, Shradha Menon; team contributions across Growth Engineering / Infrastructure / Marketing / Product) on how Instacart extended a single-tenant Marketplace marketing stack into a multi-tenant marketing-automation platform for Storefront Pro's 350+ retailers without sacrificing tenant isolation, performance, or brand integrity. Five-stage pipeline: (1) Instacart-built React console where retail marketers configure campaigns + template variables + audiences; (2) shared Campaigns Engine evaluates audiences, assigns experiment variants, generates offers, and emits one event per matched customer to a streaming platform; (3) a stream consumer rebatches per-user events into groups of up to 50 to match the third-party provider's batch-send API; (4) the CRM Service (Rails engine
Sidekiq async workers) validates idempotency, routes each batch to the correct retailer workspace, assembles personalized content, and sits behind a deliberate vendor- abstraction service layer (named verbatim: "flexibility to change providers in the future, support multiple providers at once, and avoid tightly coupling our core platform to any one vendor"); (5) the third-party provider sends through isolated per-retailer workspaces that hold each retailer's customer data, templates, IP allocation, and rate-limit budget. Canonical architectural move: the third-party-vendor workspace becomes the tenant boundary — a new shape on the wiki's tenant-isolation spectrum (shape 7) where the vendor, not the platform, operates the isolation primitives. Around the workspace primitive Instacart builds the operational machinery the vendor doesn't provide: automated IP warming with a feedback control loop monitoring bounce / spam-complaint / engagement metrics (50–1,000/day → full volume over 4–6 weeks; capacity expansion auto-triggered on threshold breach — see patterns/automated-ip-warming-with-deliverability-feedback), CI/CD-driven Liquid template deployment from a metadata file (hours of manual work → minutes; see patterns/template-deployment-via-cicd-metadata-file), and self-service template editor with live email + push preview. Disclosed outcomes: hundreds of thousands → millions of personalized messages per campaign; 99.9% delivery success across all retailers; sub-minute template updates; zero cross-retailer data-leakage incidents; retail partners self-service campaigns without engineering involvement — the platformization payoff. Operational tradeoffs: shared IPs across retailers "where appropriate" (cost vs cross-tenant reputation contamination); centrally-owned retailerName@example.com from-address (operational automation vs brand-domain purity, no per-retailer SPF/DKIM coordination tax); rebatching per-user events into batches of 50 (vendor's batch-API max dictates the size, not consumer-side latency optimum). Created (14 new pages): 1 source + 4 systems (systems/instacart-storefront-pro, systems/instacart-marketplace, systems/instacart-campaigns-engine, systems/instacart-crm-service) + 4 concepts (concepts/per-tenant-workspace-isolation, concepts/per-tenant-rate-limit, concepts/ip-warming, concepts/sender-reputation) + 5 patterns (patterns/per-tenant-workspace-in-third-party-saas, patterns/stream-rebatch-for-downstream-batch-api, patterns/vendor-abstraction-service-layer, patterns/template-deployment-via-cicd-metadata-file, patterns/automated-ip-warming-with-deliverability-feedback). Extended: concepts/tenant-isolation gains a seventh shape (per-tenant workspace inside a third-party SaaS) on the wiki's isolation spectrum, the only shape where the vendor operates the boundary; concepts/noisy-neighbor gains a seventh response axis (per-tenant rate-limit budgets in a third-party SaaS API as a structural noisy-neighbor mitigation distinct from the within-host EBS / S3 / Netflix shapes). Ninth Instacart source on the wiki — first ingest covering the multi-tenant infrastructure for Storefront Pro itself (prior Instacart ingests have all been ML-platform stories on Marketplace or cross-cutting). Caveats: third-party vendor never named; streaming substrate not named; per-retailer rate-limit values, idempotency-key TTL, and worker-pool sizing not disclosed; A/B experimentation plane referenced but not designed; the AI-Driven Optimization and What's Next section is forward-looking aspiration (adaptive campaigns, AI-assisted content generation, multi-channel intelligence) rather than architecture.
2026-05-04 — Empowering Carrot Ads with Domain Adaptive Learning → sources/2026-05-04-instacart-empowering-carrot-ads-with-domain-adaptive-learning — Instacart Engineering post (authors: Trey Zhong, Xiyu Wang; contributors: Joseph Haraldson, Sharad Gupta, Sarah Lamacchia) on how the Carrot Ads omnichannel retail-media platform onboards new retailer partners with a performant pCTR model from day one despite the new-partner cold-start problem. The recipe is Domain Adaptive Learning, a subset of transfer learning applied at two layers: (1) neural-network level — shared shopping- context-pre-trained embedding layers reused across all partners, feature transfer (wide ⇄ explicit, deep ⇄ pre- trained dense), selective fine-tuning of partner-specific layers, generalization via reuse; (2) training-data level — Marketplace-as-source corpus, catalog taxonomy alignment between source and target, per-partner feature trimming driven by feature-importance analysis to honor real-time auction latency budgets and accommodate variable per-partner feature availability. Backbone is a textbook Wide-and-Deep pCTR model — "This architecture combines a linear 'wide' model (for memorization of specific feature interactions) with a 'deep' neural network (for generalization)" — which has the unusually clean property that its two arms map onto DAL's source/target separation at the layer level. Counter- intuitive disclosed property: DAL outperforms from-scratch training even when the target partner has enough data to train independently — "because of the benefits from Instacart's first party data"; the source-domain Marketplace data is the structural moat, not just a cold-start mitigation. Reported lift across search ads + product-category ads (higher CTR / clicks-per-user / revenue, no percent disclosed). Gating risk: negative transfer, guarded today by human-in-the-loop schema mapping + alignment verification; future automated Domain Adaptation Platform is planned to detect domain shifts and streamline partner onboarding. Sixth Instacart ML-platform story on the wiki (Carrot Ads joins PIXEL / PARSE / Maple / LACE / Generative Recommendations), continuing the "platformise a specific ML capability + reuse proven backbones" arc — here applied to the retail-media vertical. Created (8 new pages): 1 source + 2 systems (systems/instacart-carrot-ads, systems/instacart-carrot-ads-pctr-model) + 5 concepts (concepts/transfer-learning, concepts/domain-adaptive-learning, concepts/source-and-target-domain, concepts/wide-and-deep-architecture, concepts/negative-transfer, concepts/feature-taxonomy-alignment) + 2 patterns (patterns/cross-domain-warm-start-via-shared-embeddings, patterns/per-partner-feature-trimming-for-auction-latency). Extended concepts/cold-start with a third recsys cold-start sub-case (new-domain / new-partner, distinct from new-item and new-user) and concepts/ctr-prediction with a multi-tenant retail-media canonical instance. Caveats: no quantitative lift / latency / partner-count numbers; per-partner serving topology undocumented; the "shared embeddings pre-trained on shopping contexts" corpus is unspecified; negative-transfer detection is asserted-not-measured.
2026-02-26 — Our Early Journey to Transform Instacart's Discovery Recommendations with LLMs → sources/2026-02-26-instacart-our-early-journey-to-transform-discovery-recommendations-with-llms — Instacart Engineering post (authors: Moein Hasani, Hamidreza Shahidi, Trace Levinson, Guanghua Shu) announcing an early-stage rebuild of the Shopping Hub discovery surface on a new generative AI content platform. Legacy Shopping Hub was human-authored placements (each placement explicitly defined with title, visual assets, retrieval sources; joined a static content library served uniformly across all users) — named limitations: expensive + slow to scale personalized content and cross-placement chaos from siloed teams creating placements independently. Three north-star tenets for the rebuild: delightful personalization, cross-placement cohesion, adaptability to shifting business objectives. Canonical architectural move: top-down generation beats bottoms-up — generate ordered themed placements first, then generate products per theme (bottoms-up alternative of generate-all-products-then-cluster rejected because "we felt our adaptability goal would be put at risk"). Four-phase cascaded pipeline (patterns/top-down-cascaded-page-generation): Phase 1 — page design + theme generation via LLM with constrained decoding against a structured schema; emits ordered themed placements plus derived signals (user personas + freeform product concepts like "eggs") to avoid redundant Phase-2 context passthrough. Phase 2 — retrieval keyword generation via a teacher- student fine-tune (frontier teacher → LLM-judge-filtered training data → fine-tuned internal student with ablations across Llama + Qwen families + LoRA at varying ranks) plus RAG candidate pruning: Phase-1's freeform concepts are embedded, k-NN retrieves ~100 nearest neighbours from a 300,000-term keyword corpus, only the pruned subset passed to Phase-2 LLM — 15–20% all-in cost reduction per generation, explicitly named as "a core motivator for adopting a cascaded generation architecture." Phase 3 — quality + diversity filtering stack: (i) embedding-similarity deduplication across placements; (ii) multi-level LLM-as- judge on a small user fraction (page / placement / product levels); (iii) fine-tuned DeBERTa cross- encoder (patterns/fine-tuned-cross-encoder-as-filter) that scores theme-product relevance for every placement's products — >99% cost reduction vs LLM inference, the economic unlock that lets evaluation become action at full catalog scale; (iv) business + policy guardrails (canonical forbidden pairing: "alcoholic products for a child's birthday party"). Instacart's explicit framing: "LLM-as-a-judge… guided us well at the averages, it failed at the edges. Since evaluating millions of candidates is cost-prohibitive, LLMs are unable to take action and improve quality at scale." Phase 4 — existing product + placement ranking services, unchanged — the cascade's finalised outputs are cached for runtime retrieval; existing rankers consume them. Same "wrap new generative-AI primitives around existing mature infra, don't replace it" stance as PIXEL / PARSE / Maple. Three-prong evaluation framework: (a) LLM-as-judge at three hierarchy levels — page (cohesion + coverage), placement (title + brand + user- preference alignment), product (recall + thematic alignment); tuned to pass "high human-alignment thresholds" via HITL workflows; (b) fine-tuned DeBERTa at scale on the specific dimensions LLM-as-judge hits diminishing returns on; (c) classical ML + metric-based evaluators (avg fraction of products in user's purchase history, predicted engagement score from existing rankers, avg products per placement density). Evals are "a massive accelerant" — upfront investment compounds into iteration velocity. Numbers NOT disclosed (in keeping with Instacart's early-journey framing): latency per phase, end-to-end latency, A/B production outcomes, specific teacher LLM + student model names, LoRA rank, fine-tune dataset size, cost per page, cold-start strategy, Phase-3 cache TTL. Eighth Instacart source on the wiki and fifth Instacart ML-platform story — extending the "platformise generative AI + keep existing mature ranking infra" architectural stance from PIXEL (image generation) / PARSE (structured extraction) / Maple (batch LLM) / Intent Engine (query understanding) into discovery content generation. Canonical wiki instances of patterns/top-down-cascaded-page-generation + patterns/rag-candidate-pruning-cascade + patterns/fine-tuned-cross-encoder-as-filter + patterns/llm-as-judge-multi-level-rubric + concepts/generative-recommendations + concepts/top-down-vs-bottoms-up-generation + concepts/cascaded-llm-generation + concepts/placement-theme-cohesion + concepts/constrained-decoding-structured-output.
2026-02-17 — Turning Data into Velocity: Caper's Edge and Cloud Data Flywheel with Capsight → sources/2026-02-17-instacart-turning-data-into-velocity-capers-edge-and-cloud-data-flywheel-with-capsight — Instacart Engineering post (authors: Youming Luo, Andrew Tanner, Matas Sriubiskis, Sylvia Lin, Sikun Zhu, Lei Li, Xiao Zhou) introducing Capsight, the edge→cloud data flywheel for Instacart's Caper smart-cart fleet. Three-component architecture (Collector on-device → Depot in the cloud → Ray-based Learner); closed loop Collect → Manage → Label → Train → Deploy. Core problem named: Caper models trained on manually-collected data underfit production diversity (concepts/production-data-diversity: lighting, occlusion, damaged packaging, motion blur, store-specific SKUs); each cart emits "gigabytes" of multi-modal data; end-to-end iteration cycle was a month; annotation cost grew linearly with fleet size by default. Load-bearing design goal: iteration cost must not grow linearly with deployment size. Collector: trigger-based capture (activity signal + recognised barcode), dedicated hardware video encoder + dedicated weight/location protocol for zero regression on the cart's primary AI tasks (concepts/hardware-offload), resilient uploader (bandwidth-aware to not hurt retailer store networks + storage-threshold-pauses-collection + auto-cleanup-oldest-on- upload-failure). Depot: distributed ingestion/processing, metadata indexing + searchable web UI (observability of the fleet), and the cost-moving innovation — [[patterns/vlm- assisted-pre-labeling|VLM + teacher-model pre-labelling]] where empty backgrounds are auto-filtered, a VLM plus internal teacher models generate pre-labels for items + barcodes, and humans correct rather than create. Projected >70% annotation cost reduction; multi-day tasks → hours; same pipeline cleans historical ground-truth errors. Learner: "distributed, Ray-based training platform" (Ray is now canonical for Capsight training) with automated evaluation against standard test sets; drops model training stage from 1 week to 2 days. End-to-end outcomes: iteration cycle 1 month → 1 week; early models trained on Capsight-curated data show >5% accuracy improvement within weeks, with continued gains as fleet scales. Future work: full multi-modal sensor-fusion foundation model (concepts/multi-modal-attribute-extraction applied to real-world physical-environment understanding), intent detection for complex multi-item interactions, automatically-surfaced highest-value training data. Sixth Instacart source on the wiki; first one crossing the edge/cloud boundary and the first ML-data-platform source on the wiki (PIXEL / PARSE / Maple are ML-serving platforms; Capsight is an ML-training-data platform). Canonical wiki instances of [[concepts/edge-cloud-data- flywheel]] + concepts/production-data-diversity + patterns/distributed-fleet-as-data-pipeline + patterns/trigger-based-edge-capture + patterns/vlm-assisted-pre-labeling + patterns/resilient-edge-uploader.
2026-02-09 — From Print to Digital: Making Weekly Flyers Shoppable at Instacart Through Computer Vision and LLMs → sources/2026-02-09-instacart-from-print-to-digital-making-weekly-flyers-shoppable — Instacart Engineering post (author: Prithvi Srinivasan per inline Medium byline) on the internal flyer- digitization pipeline. Two-phase architecture: Phase 1 segments each flyer image into per-product bounding boxes; Phase 2 matches each box to a concrete Instacart catalog SKU via OCR + LLM + internal search. Before/after: 3–4 hours manual work per flyer → <30 minutes end-to-end after automation. Rejected approaches: FoodSAM (food-specific SAM variant) for insufficient product breadth; pure multimodal-LLM bounding- box prediction on complex flyers for imprecision; classical contour detection standalone for noise. Shipped Phase-1 architecture: complexity-tiered routing — simple flyers use iterative-grid multimodal-LLM probing (~90% accuracy); complex flyers use SAM
four post-processing stages (text-box removal, WBF box-merging, SAM + contour ensemble gated per-retailer on flyer density, heuristic + ML filters on aspect ratio + size). WBF vs NMS is explicitly motivated: NMS "may discard valuable information by eliminating lower-confidence boxes" — cited prior art: +3–10% mAP in medical-imaging ensembles. Phase-2 is captured-body truncated — named challenges (multi-item deals → N-SKU, generic produce with no branded text to OCR) are documented but the LLM stack + catalog- search integration are not elaborated in the captured body. Canonical wiki instances of patterns/hybrid-cv-plus-llm-pipeline + patterns/complexity-tiered-model-selection. Fourth Instacart source on the wiki; fourth Instacart visual-ML system alongside PIXEL (generation), PARSE (attribute extraction), and the Caper Jetpack-Compose migration.
2026-02-03 — Migrating to Jetpack Compose: How AI Accelerated Our Journey at Caper → sources/2026-02-03-instacart-migrating-to-jetpack-compose — Instacart Engineering post (author Matt Kranzler) on migrating the Android app powering Caper smart carts (AI + computer-vision in-store scan-and-pay carts, stability-critical hardware) from Fragments + XML layouts to Jetpack Compose via a deliberate four-phase plan accelerated by AI coding assistants. Phase 1 (manual, no AI) removed explicit Fragment wrappers using Google's navigation-fragment-compose for implicit Fragment hosts; established the load-bearing outer-parameterless / inner-testable Composable split that makes the eventual Phase-4 Compose-Navigation migration cheap. Phase 2 migrated 30+ sub-navigation graphs and 130+ destinations from XML resource-ID navigation to type-safe Kotlin DSL via an iterative AI workflow (Learn-by-Doing → Git-Diff-as-Context → Correct-and-Refine → Update-the-Guide → Repeat); reported 5–7× faster, 300–350 engineering hours saved, "migrations previously too tedious to justify" became feasible. Phase 3 converts 100+ Fragment features to pure Compose via a 17-step AI skill with engineer verification checkpoints across four stages (Analysis+Baselining using Paparazzi screenshots / Compose Implementation / Verification+Integration with visual-parity diff / Cleanup); formalised from an earlier 325+ line markdown migration guide into a structured Agent Skill after 5–6 prior migrations made the workflow predictable — "Skills enable progressive disclosure of information, allowing the AI to access exactly what it needs at each step without overwhelming the context window." Phase 4 (Compose Navigation) runs in parallel with the tail of Phase 3 behind feature flags. Four named principles for AI-assisted refactoring: (1) the economics of technical debt have changed, (2) treat AI instructions as code — the guide "is effectively a program that the AI executes," triple-duty (AI executes, humans checklist, reviewers verify), (3) incrementalism mitigates AI risk, (4) invest in the workflow not just the tool — when Agent Skills emerged mid-project the workflow evolved. Engineer-role shift named: "from execution to definition and validation" — architecture + pattern definition + oversight, not typing the thousands of mechanical edits. Scaling claim (directional, not quantified): "other engineering teams across Instacart are now using AI skills to tackle their own large-scale refactoring challenges" — explicit playbook posture vs. single- migration retrospective. Canonical wiki instances of patterns/phased-framework-migration + patterns/ai-migration-skill-workflow + patterns/visual-parity-screenshot-gate + concepts/ai-assisted-refactoring-economics + concepts/ai-instructions-as-code. First Instacart mobile-platform source on the wiki (after PIXEL / PARSE / Maple / Intent Engine on the ML-platform axis) and first wiki instance canonicalising AI-skill-driven mobile UI framework migration. Fifth Instacart source on the wiki.
2025-11-13 — Building The Intent Engine: How Instacart is Revamping Query Understanding with LLMs → sources/2025-11-13-instacart-building-the-intent-engine — Instacart Engineering post replacing the legacy query-understanding stack (multiple bespoke ML models) with an LLM-backed Intent Engine, layered across three progressively-more-invasive adaptation techniques: prompting → context-engineering (RAG) → fine-tuning. Three QU sub-tasks rebuilt: (1) query category classification via retrieve-top-K-converted-categories → LLM re-rank with injected Instacart context → semantic-similarity guardrail filter — replaces legacy flat-multi-class FastText that emitted taxonomically-inconsistent pairs and lacked world knowledge; (2) query rewrites via three specialised prompts — Substitutes / Broader / Synonyms — each with chain-of-thought + few-shot exemplars + post-processing relevance guardrail — lifted coverage from legacy ~50% to >95% at 90%+ precision; (3) SRL (query tagging — product/brand/attribute) via a load-bearing hybrid cache + real-time fine-tuned model architecture. SRL deep-dive is the post's substance: offline RAG "teacher" pipeline (conversion data + catalog + brand-embedding similarity + frontier LLM + post-processing guardrail) is dual-purposed — its output populates the head-query cache AND becomes the supervised training set for a Llama-3-8B + LoRA student. Latency path for the student, out-of-box to production: ~700 ms on A100 → 300 ms target after adapter merging + H100 upgrade; FP8 quantization gave another 10% but was not shipped because of a slight recall regression; GPU autoscaling at off-peak manages cost. Only 2% of queries hit the real-time model; ~98% served from cache. Production quality: precision 96.4% vs 95.4% frontier baseline, recall 95.0% vs 96.2%, F1 95.7% vs 95.8% — F1-parity with precision-bias. A/B outcomes: 6% reduction in average scroll depth on tail queries, 50% reduction in user complaints on tail-query search quality, millions of cold-start queries served weekly. Strategic framing: "A generic LLM is a commodity; your business context is what makes your application defensible, because domain knowledge is the most valuable asset." Authors: Yuanzheng Zhu, Guanghua Shu, Raochuan Fan, Vinesh Gudla, Tejaswi Tenneti. Third Instacart source on the wiki — extends the PIXEL (content generation) + PARSE (structured extraction) platform-consolidation pattern-graph into the retrieval-relevance axis; same "stop every team from DIY'ing this" architectural stance, different data surface (search queries). Canonical wiki instances of patterns/head-cache-plus-tail-finetuned-model + patterns/offline-teacher-online-student-distillation.
2025-08-27 — Simplifying Large-Scale LLM Processing across Instacart with Maple → sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple — Instacart Engineering post on Maple, the internal batch- LLM-processing service consolidating every team's batch workflows into one. CSV/Parquet in, CSV/Parquet out RPC hiding the provider's 50K-prompt / 200 MB / 24 h batch API. Technology stack: Python + PyArrow + orjson + Temporal for durable execution + S3-Parquet intermediate storage. Architectural layers: Maple on top of the internal AI Gateway on top of the external LLM provider; the AI Gateway integrates with Cost Tracker for per-team attribution. Production numbers from a ~580-batch / 40–50K-tasks-per-batch sample: mean 2.6 prompts/sec/batch; most batches complete in < 12 h (SLA 24 h); scale to 10M+ prompt jobs; ~50% cost reduction vs real-time; "hundreds of thousands of dollars per year to just thousands" on specific processes. Four-class failure taxonomy (expired
rate-limited = infinite retry; refused = max 2×; invalid-image = optional with image-check on retry #2 only — "checking each image in a large batch can add significant overhead"). Real-time fallback for providers without batch APIs, behind the same CSV interface — platform hides provider-capability heterogeneity. Three scale optimisations forced by growth to 10M+ prompts: (1) DB → S3 Parquet intermediate storage (25× compression), (2) stream-based processing, (3) orjson for JSON parsing. Adoption: Catalog, Fulfillment, Search, ML-training teams each with distinct workloads. Canonical patterns/llm-batch-processing-service + patterns/batch-then-real-time-fallback + patterns/infinite-retry-by-failure-class + patterns/csv-in-parquet-intermediate-output-merge. Same "stop every team from DIY'ing this" architectural stance as PIXEL + PARSE, at the batch-inference layer. Third Instacart source on the wiki.
2025-08-01 — Scaling Catalog Attribute Extraction with Multi-modal LLMs (PARSE) → sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms — Instacart Engineering post announcing PARSE (Product Attribute Recognition System for E-commerce), the internal self-serve multi-modal LLM platform for structured attribute extraction across the catalog. Four components (declarative + versioned Platform UI → ML extraction endpoint with self-verification confidence score → quality screening with dev/prod HITL loops → catalog ingestion). Three reusable architectural ideas surfaced: (1) multi-modal reasoning closes the text-only blind spot — sheet_count recall +10% over text-only LLM, with two archetypal examples: 80-sheets-on- packaging (image-only signal) and "3 boxes of 124 tissues" (text-only but needs multiplication); (2) per-attribute prompt-tuning effort + LLM size are load-bearing — organic 1 day / 95% accuracy first prompt, low-sugar 3 days; cheap LLM gives -70% cost at equivalent quality on simple attributes but -60% accuracy on hard ones, motivating per- attribute model choice; (3) future cost reduction comes from prompt batching (multi-attribute or multi-product) + extraction cache keyed by a product-similarity function. Same architectural DNA as sibling PIXEL: self-serve UX, model-agnostic platform stance, LLM-as-judge in the evaluation loop — applied to structured extraction instead of image generation. Second Instacart source on the wiki.
2025-06-11 — Turbocharging Customer Support Chatbot Development with LLM-Based Automated Evaluation → sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm — Instacart Engineering post (authors: Lily Sierra, Nour Alkhatib, Steven Gross, Jacquelene Obeid, Kyle Swint, Monta Shen, Gary Song, Riddhima Sejpal, Jatin Jain, Shishir Kumar Prasad, Ayesha Saleem) introducing LACE (LLM-Assisted Chatbot Evaluation), the internal offline-evaluation framework scoring every evaluated customer-support chat session against a binary True/False rubric across five dimensions ( Query Understanding / Answer Correctness / Chat Efficiency / Client Satisfaction / Compliance). Each criterion emits True/False plus a free-form rationale — rationale retention enables targeted criterion-prompt refinement. Three evaluation engines benchmarked: direct prompting (one- pass LLM scoring, baseline), agentic reflection (initial score → self-reflection pass, citing Madaan et al. 2022 + Jang 2023 + Madaan et al. 2024), agentic debate (Customer-Agent critical + Support-Agent defensive, parallel with no cross-talk + Judge-Agent synthesises, citing Du et al. 2023 arXiv:2305.14325). Debate wins — "near-perfect accuracy" on simple Compliance criteria (tone / politeness / professionalism), >90% accuracy on context- dependent criteria (canonical named example: "my card was declined. My payment method is [a digital wallet]" — the chatbot must know Instacart shoppers use company-authorized cards, not the customer's payment method). Business-model knowledge currently embedded in a static template in the judge prompt; dynamic RAG-retrieval named explicitly as future work. Subjective criteria (e.g. answer conciseness under Chat Efficiency) intentionally de-prioritised — "a low-ROI path" to refine ambiguous criteria; retained only as directional regression check; effort spent fixing the chatbot rather than the judge on these axes. Load-bearing implementation decision #1: decouple free-form reasoning from structured-output formatting — strong reasoner (o1-preview at writing time, "our best-performing option... but lacked consistent JSON formatting capabilities") emits free-form rationale, cheaper LLM or rule-based parser converts to JSON; motivated by Tam et al. 2024 arXiv:2408.02442 on restricted- decoding quality loss. Load-bearing implementation decision #2: human-LACE alignment loop bootstraps the rubric + regression- tests every update — humans rate a curated chat set on the same rubric, misalignments drive criteria-prompt refinement (primary lever, frequent) or criteria-structure redesign (rare); same structural shape as Dropbox Dash's humans-calibrate-the- judge pattern applied to the rubric-refinement object rather than the training-data-labelling object. Production deployment: stratified sampling based on topic distribution → trend dashboards + drill-down interaction analysis + direct integration with Instacart's experimentation platform so LACE verdicts feed chatbot A/B tests in real time. Prompt authoring: evaluator prompts in Markdown using the CO-STAR framework; prompt formatting cited as a measured first-order quality lever (Chen et al. 2024 arXiv:2411.10541). Seventh Instacart source on the wiki — first chatbot-evaluation infrastructure source at Instacart, extending the "stop every team from DIY'ing this" platform-consolidation thesis from PIXEL (image gen) / PARSE (structured extraction) / Maple (batch LLM) / Intent Engine (query understanding) / Caper (mobile UI migration) / Capsight (ML-training data) into the LLM-quality-measurement axis. Parallel-play sibling to Lyft's LLM-as-judge localization (2026-02-19) and Zalando's search AI-as-judge (2026-03-16) — three Tier-2/3 companies shipping essentially the same LLM-as-judge → dashboard → experimentation-loop architecture to different customer- facing surfaces at the same time. Canonical wiki instances of patterns/multi-agent-debate-evaluation + patterns/self-reflection-llm-evaluation + patterns/human-aligned-criteria-refinement-loop + concepts/binary-vs-graded-llm-scoring + concepts/decouple-reasoning-from-structured-output + concepts/llm-evaluation-dimensions + concepts/human-llm-evaluation-alignment + concepts/stratified-topic-sampling + concepts/co-star-prompt-framework. Caveat: captured body is truncated before the in-post chat-session illustrative example; quantitative per-engine win-rate matrix is not published.
2025-07-17 — Introducing PIXEL: Instacart's Unified Image Generation Platform → sources/2025-07-17-instacart-introducing-pixel-instacarts-unified-image-generation-platform — Instacart Engineering announcement post on PIXEL, their internal unified image-generation platform. Five architectural components (unified parameter protocol across models + prompt-template + few-shot library + DreamBooth fine-tunes on Stable Diffusion for product-specific categories + automated VLM-based quality evaluation in a 4-step iterative refinement loop + RPC service on existing Instacart infra with S3 storage + [[systems/ snowflake|Snowflake]]-addressable image URLs). Reported headline numbers: 10× team time-to-image reduction; 20% → 85% human-judge approval rate after the VLM loop shipped; >25% reduction in Butcher Cuts navigation + add-to-cart time; 15% uplift in Lifestyle Imagery personalised- carousel cart conversion. Canonical design argument: "the best performing model varied project by project" — so PIXEL optimises for cheap cross-model A/B testing via the unified parameter protocol rather than standardising on one model. Key contributor: Shishir Kumar Prasad. First Instacart source on the wiki.