INSTACART 2026-02-09 Tier 2

From Print to Digital: Making Weekly Flyers Shoppable at Instacart Through Computer Vision and LLMs¶

Summary¶

Instacart Engineering describes the flyer digitization pipeline that converts retailer-supplied grocery flyer images into interactive, shoppable experiences on the Instacart platform. The 2024-launched feature initially depended on a manual workflow — a human drew bounding boxes around each deal on the flyer and matched each to the Instacart catalog — which took 3–4 hours per flyer and did not scale as dozens of retailers adopted weekly uploads. The engineering team replaced that workflow with an automated two-phase pipeline: Phase 1: Image Segmentation (extract a bounding box around every product/deal on the flyer) and Phase 2: Product Identification (match each segmented box to a concrete catalog product). End-to-end runtime after the rewrite is <30 minutes per flyer. The post focuses on Phase 1 — the post-body captured on the wiki ends after the segmentation write-up; the Phase 2 OCR + LLM + search details are stated as pipeline components but not elaborated in the captured body. Phase 1's design rejects three naive alternatives (off- the-shelf food-specific segmentation à la FoodSAM, pure multimodal-LLM bounding-box prediction on complex flyers, classical contour detection) and instead layers four post-processing techniques on top of Meta's Segment Anything Model (SAM): text-box removal, Weighted Boxes Fusion (WBF) to merge overlapping detections, model ensembling (segmentation + contour detection) gated on per-retailer flyer density, and heuristic + ML filtering on aspect ratio + size to reject noise. For simple flyers specifically the team does use a multimodal-LLM iterative-coordinate-probing technique and reports ~90% accuracy at that tier. The architectural thesis is that retail-flyer digitization is not a single-model problem — the range of layouts across retailers forces a hybrid pipeline where model choice is a function of flyer complexity.

Key takeaways¶

The before-state number is the case. Manual digitization cost 3–4 hours per flyer per retailer; at dozens of retailers uploading weekly, the team was facing "a mounting workload of hundreds of hours each week" and the manual SLA forced retailers to submit flyers "well ahead of time so we could process them before the deals went live" — i.e. the manual process also constrained retailer flexibility, not just Instacart cost. The rewrite cut end-to-end to <30 minutes per flyer. (Source: this article)
Two-phase pipeline, not end-to-end. Phase 1 is pure computer-vision (segment the flyer into product-bounding-boxes), Phase 2 is product-identification (OCR + LLM + internal search to match each box to a catalog product). The team explicitly separates localization from identification rather than trying to solve both jointly with a single multimodal model. (Source: this article, patterns/hybrid-cv-plus-llm-pipeline)
Multimodal LLMs alone fail on complex flyers — but work for simple ones. For simple flyers (well-separated, few-boxes), the team iteratively asks a multimodal LLM for box coordinates by drawing a uniform grid on the image and asking for the X/Y of the first box, then subdividing — ~90% accuracy. For complex flyers (overlapping products, decorative text, varying layouts, mixed produce + packaged goods) "multimodal LLMs produce imprecise bounding boxes" and classical contour detection "generated excessive noise, rendering their outputs unusable without extensive post-processing." The explicit lesson: match model to flyer complexity, don't pick one model for both cases. (Source: this article, patterns/complexity-tiered-model-selection, concepts/iterative-coordinate-grid-probing)
SAM as the foundation, not the whole solution. The team uses Meta's Segment Anything Model as the Phase-1 base detector for complex flyers, but SAM outputs require four post-processing stages before they're usable: (1) text-box removal (strip decorative + promotional text boxes that don't correspond to products), (2) WBF-based box merging, (3) model ensembling with contour detection on dense flyers, (4) heuristic + ML-filter pass on aspect ratio and size. Named failure mode the team rejected: FoodSAM, a food-specific SAM variant — "fell short of addressing the breadth and variety of products featured in retail flyers." (Source: this article, systems/segment-anything-model-sam)
Weighted Boxes Fusion beats Non-Maximum Suppression when detectors disagree. The team explicitly chose WBF over NMS for merging overlapping detections, because NMS "may discard valuable information by eliminating lower-confidence boxes" whereas WBF "combines all overlapping boxes by computing a confidence-weighted average of their coordinates." Cited prior-art number from medical imaging: WBF combined with multiple detectors gave +3–10% mAP over the best single model. In this pipeline, WBF merges nearby boxes that likely represent the same product. (Source: this article, concepts/weighted-boxes-fusion, concepts/non-maximum-suppression)
Model ensembling is gated on flyer density, per retailer. The team ensembles SAM-style segmentation outputs with contour-detection outputs, but "the decision whether or not to use contour detection models was based on how densely the flyer images were packed. This varied from retailer to retailer." So the ensemble is dynamically configured per-retailer — another instance of matching model choice to input complexity rather than running the max-compute pipeline on everything. (Source: this article, concepts/model-ensembling-for-detection)
Post-processing is a two-kind filter: heuristic + ML. False- positive boxes are rejected by (i) heuristic rules on relative size and aspect ratio of bounding boxes — decorative or text-dominated boxes have wrong ratios — and (ii) ML-based filters explicitly trained to distinguish valid product boxes from noise. The hand-written heuristics are not discarded in favour of "just train a model" — both stages are retained. (Source: this article)
Phase 2 is described only at the shape level in the captured body. The post names Phase 2 as "optical character recognition (OCR), large language models, and our existing search infrastructure" to match each segmented box to catalog products, "even when deals feature multiple items or generic produce." The captured raw body ends before the Phase-2 elaboration; wiki treatment of OCR and product-matching details is therefore light and marked ⚠️ captured body truncated. Mentioned explicitly as Phase-2 challenges: deals-with-multiple-items (one bounding box → N catalog SKUs) and generic produce (no branded text to OCR against). (Source: this article, caveat below)

Systems / concepts / patterns extracted¶

Systems:

systems/instacart-flyer-digitization-pipeline — the two-phase flyer-to-shoppable pipeline itself. New wiki page. Before/after numbers: 3–4 h → <30 min per flyer.
Meta's Segment Anything Model (SAM) — the foundation detector for Phase 1 on complex flyers. New wiki page. Notable: FoodSAM (the food-specific SAM variant) was explicitly rejected as insufficient for retail-flyer diversity.

Concepts:

Weighted Boxes Fusion (WBF) — detector-fusion technique; replaces NMS when merging overlapping bounding boxes from multiple detectors. New wiki page.
Non-Maximum Suppression (NMS) — the classical detector-de-duplication baseline WBF explicitly improves on. New wiki page.
concepts/model-ensembling-for-detection — combining outputs from multiple detection models (segmentation + contour) to cover different feature regimes. New wiki page.
concepts/iterative-coordinate-grid-probing — the uniform-grid-then-subdivide technique for extracting bounding-box coordinates from a multimodal LLM on simple images. New wiki page.

Patterns:

patterns/hybrid-cv-plus-llm-pipeline — decompose localization + identification into separate phases; classical/CV primitives for the first, LLM/OCR/search for the second. Canonical wiki instance. New wiki page.
patterns/complexity-tiered-model-selection — route each input to a different model stack based on estimated complexity (simple → cheap LLM probing; complex → SAM + post-processing + optional ensemble). New wiki page.

Architectural numbers¶

Manual SLA: 3–4 hours per flyer, growing to "hundreds of hours each week" as retailers onboard.
Automated SLA: "less than 30 minutes once a flyer is uploaded."
Simple-flyer LLM accuracy: "~90% accuracy" on box extraction using the iterative-grid multimodal-LLM technique.
WBF prior-art reference: "in medical imaging, combining outputs from multiple detectors using WBF has led to an increase in mean Average Precision (mAP) by approximately 3–10% over the best single model."
No per-phase latency, per-retailer accuracy, or false-positive/ false-negative rates disclosed for the production complex-flyer pipeline.

Caveats¶

⚠️ Captured body is truncated. The raw markdown on file ends after the Phase 1 sub-section "Filtering with Heuristics and Machine Learning". Phase 2 (OCR + LLM + catalog-search matching) is named in the pipeline overview but not elaborated in the captured body. The published Medium article likely contains additional Phase-2 detail not reflected on this page.
No p50 / p99 latency numbers disclosed. The post gives only an aggregate end-to-end "<30 minutes" from upload to interactive. No per-stage breakdown (segmentation latency vs. matching latency vs. human-review overhead, if any) is disclosed.
Retailer-specific adaptation is mentioned but not enumerated. "The decision whether or not to use contour detection models was based on how densely the flyer images were packed. This varied from retailer to retailer" — so there's per-retailer configuration, but the config schema, the density threshold, and the set of retailers onboarded are not disclosed.
No accuracy / precision / recall numbers for the complex-flyer production pipeline. Only the simple-flyer LLM tier has a disclosed accuracy (~90%). For the SAM-based complex pipeline the post speaks qualitatively ("accurately extract most of our targeted bounding boxes").
No human-in-the-loop signal disclosed. The post does not state whether the automated pipeline routes low-confidence outputs to a human reviewer (as Instacart's sibling PARSE system does via low-confidence-to-HITL routing), or ships every flyer fully-automated. Given the sibling-system pattern at Instacart, a HITL fallback is plausible but unconfirmed.
Authors not disclosed in captured body. The post is by Prithvi Srinivasan per the inline Medium byline reference; no co-authors or team affiliation inside Instacart are stated in the captured body.

Source¶

systems/instacart-flyer-digitization-pipeline — canonical system this source describes
systems/segment-anything-model-sam — the base detector
concepts/weighted-boxes-fusion — the detector-merge technique
concepts/non-maximum-suppression — the technique WBF explicitly improves on
concepts/model-ensembling-for-detection — the per-retailer ensemble gating
concepts/iterative-coordinate-grid-probing — the simple-flyer LLM coord technique
patterns/hybrid-cv-plus-llm-pipeline — the two-phase architecture
patterns/complexity-tiered-model-selection — route by input complexity
concepts/vlm-as-image-judge — sibling Instacart usage of VLMs in a pipeline (quality gate, not detector)
concepts/multi-modal-attribute-extraction — sibling Instacart usage of multimodal LLMs (catalog attributes, not layout)
companies/instacart