SYSTEM Cited by 1 source
Instacart Flyer Digitization Pipeline¶
Definition¶
Instacart's Flyer Digitization Pipeline is the internal computer-vision + LLM system that converts retailer-supplied weekly grocery flyer images (the printed-advertisement-style promotional spread that used to ship in newspapers) into shoppable, interactive, tap-to-add-to-cart product tiles on the Instacart app. The pipeline ingests a flyer image and outputs, for each promotional deal on the flyer, (box, catalog- product) pairs that the Instacart frontend renders as clickable overlays on the original flyer layout.
The pipeline replaces a manual digitization workflow (human draws bounding boxes around every deal + human matches each box to a catalog product) that cost 3–4 hours per flyer and did not scale beyond a handful of retailers. Post-automation, end-to- end runtime is under 30 minutes per flyer. (Source: sources/2026-02-09-instacart-from-print-to-digital-making-weekly-flyers-shoppable)
Architecture¶
Two-phase pipeline¶
The pipeline is explicitly two phases, decomposed so that where is the product on the page is solved independently of which product is it.
flyer image upload
│
▼
Phase 1: Image Segmentation
├─ simple-flyer detector: multimodal-LLM iterative-grid probing
└─ complex-flyer detector:
Meta SAM
├→ text-box removal
├→ Weighted Boxes Fusion (merge overlapping boxes)
├→ model ensemble (SAM + contour detection, gated per retailer)
└→ heuristic + ML filtering (aspect ratio, size, noise)
│
▼
Phase 2: Product Identification
├─ OCR on box contents
├─ LLM reasoning over extracted text + image
└─ internal catalog search
│
▼
(box, catalog_product) pairs → rendered as interactive overlays
Phase 1 — Image Segmentation¶
Phase 1 outputs a bounding box around every product / deal on the flyer. The team evaluated three naive approaches and rejected all three:
- Off-the-shelf food-specific segmentation ( FoodSAM). Food-aware SAM variant. Rejected: "fell short of addressing the breadth and variety of products featured in retail flyers" — retail flyers include branded packaged goods, household products, and fresh produce, not just food.
- Pure multimodal LLMs (VLMs) for the full flyer. Rejected for complex flyers: "multimodal LLMs produce imprecise bounding boxes." But kept for simple flyers (see complexity-tiered selection below).
- Traditional segmentation / contour detection standalone. Rejected: "generated excessive noise, rendering their outputs unusable without extensive post-processing."
The shipped architecture is a hybrid on two axes:
Axis 1 — complexity-tiered model selection¶
Flyer complexity determines which detector stack runs (see patterns/complexity-tiered-model-selection).
- Simple flyers (well-separated boxes, few in number): use iterative-grid multimodal-LLM probing. Draw uniform grid lines on the image, ask the VLM "where does the first box begin (X/Y)?", then subdivide the identified box and ask for the starting and ending coordinates for each segmentation box recursively. Reported accuracy: ~90%.
- Complex flyers (overlapping products, decorative text, varying layouts): use the SAM-based post-processed stack described below.
The tier-assignment signal itself is not disclosed in the captured body.
Axis 2 — the SAM-based complex-flyer stack¶
Built on Meta's Segment Anything Model (SAM) as the base detector; four post-processing stages convert SAM's raw output into usable product boxes:
- Text-box removal. Strip detected boxes that correspond to decorative elements / promotional text (prices, brand banners, savings callouts) rather than products.
- Weighted Boxes Fusion (WBF). Merge overlapping boxes via confidence-weighted coordinate averaging. Chosen over classical Non-Maximum Suppression because NMS "may discard valuable information by eliminating lower-confidence boxes," whereas WBF preserves all overlapping-box information in the merged output.
- Model ensembling. Combine SAM-style segmentation outputs with classical contour-detection outputs. The contour model is gated per retailer based on flyer density: "the decision whether or not to use contour detection models was based on how densely the flyer images were packed. This varied from retailer to retailer."
- Heuristic + ML filtering. Reject false-positive boxes using (a) hand-written heuristics on relative size and aspect ratio of bounding boxes and (b) ML-trained filters specifically classifying valid product boxes vs. noise. Both kinds of filter are retained — the heuristics are not subsumed by the ML filter.
Phase 2 — Product Identification¶
Phase 2 takes each Phase-1 bounding box and matches it to a concrete Instacart catalog SKU. The post names three components:
- Optical Character Recognition (OCR) on the box's pixel content to extract brand names, product names, sizes, prices.
- Large language models reasoning over the OCR text + the box image itself to identify product + attributes.
- Instacart's existing search infrastructure to resolve the LLM's proposed product into a real catalog SKU.
Stated Phase-2 challenges:
- Multi-item deals. A single bounding box may correspond to a deal that bundles multiple catalog SKUs.
- Generic produce. Items like "bananas" or "avocados" have no branded text to OCR and are matched by image-level reasoning against the produce catalog.
⚠️ Phase 2 captured-body truncation. The raw markdown on the wiki ends after Phase 1. The published article very likely contains additional Phase-2 architectural detail (LLM model choice, cache design, confidence thresholds, HITL fallback) that is not reflected here. Re-ingest when a longer capture is available.
Production numbers¶
- Before: 3–4 hours per flyer manual work; "hundreds of hours each week" aggregated across retailers; manual workflow required retailers to submit flyers in advance.
- After: <30 minutes end-to-end per flyer; no retailer-advance-submission requirement disclosed as binding.
- Simple-flyer tier accuracy: ~90% on box extraction.
- Complex-flyer tier accuracy: not disclosed in the captured body beyond qualitative "accurately extract most of our targeted bounding boxes."
Design lessons¶
- Decompose localization from identification. The pipeline
does not try to use a single multimodal model to output
(box, product)jointly — the two phases are independently improvable and independently failure-domained. This is the canonical instance on the wiki of patterns/hybrid-cv-plus-llm-pipeline. - Route inputs to model stacks by complexity, not one-size-fits-all. Simple flyers get a cheap VLM-probing pipeline; complex flyers get the expensive SAM + ensemble pipeline. Per-retailer tuning is also a form of this: dense flyers get the contour ensemble, sparse flyers don't. See patterns/complexity-tiered-model-selection.
- Foundation-model output is usable after post-processing, not out-of-the-box. SAM is the base, but four post-processing stages are load-bearing. Teams building on foundation models should budget engineering effort for domain-specific post-processing, not assume the foundation model's direct output is shippable.
- Retain classical techniques alongside ML. Heuristic aspect-ratio + size filters sit next to trained ML filters, and contour detection sits next to SAM. ML doesn't subsume the classical primitives — it augments them where they fail.
- Merge overlapping detections by averaging, not by suppression. WBF over NMS when you have multiple detectors that sometimes disagree: averaging retains low-confidence information that NMS would drop. See concepts/weighted-boxes-fusion + concepts/non-maximum-suppression.
Relationship to other Instacart systems¶
- PIXEL (image generation platform) and PARSE (attribute extraction platform) are sibling Instacart visual-ML systems. The flyer pipeline is the third pillar on the wiki: visual layout extraction from retailer-supplied images. All three systems share the Instacart architectural stance of model-agnostic, multi-stage, LLM-in-the-loop visual pipelines, but this one has no explicit unified-platform framing — the post describes the pipeline, not a general platform.
- concepts/vlm-as-image-judge (PIXEL's quality-gate pattern) and this pipeline's VLM usage are distinct: PIXEL uses a VLM as a judge after generation; this pipeline uses a VLM as a detector for simple-flyer bounding-box coordinates.
- concepts/multi-modal-attribute-extraction (PARSE's text+image reasoning) and this pipeline's Phase-2 LLM usage are architecturally similar — OCR + image reasoning for product identification is a specialisation of PARSE-style multi-modal extraction, but scoped to catalog-matching rather than attribute extraction.
Seen in¶
- sources/2026-02-09-instacart-from-print-to-digital-making-weekly-flyers-shoppable
— canonical wiki instance. Instacart Engineering's
2026-02-09 post describing the two-phase flyer-digitization
pipeline. Phase-1 is fully described; Phase-2 is named only
at the component level in the captured body (
⚠️ truncationcaveat).
Related¶
- systems/segment-anything-model-sam — the Phase-1 base detector
- concepts/weighted-boxes-fusion — detector merging technique used in Phase 1
- concepts/non-maximum-suppression — the classical baseline WBF replaces here
- concepts/model-ensembling-for-detection — segmentation + contour ensemble
- concepts/iterative-coordinate-grid-probing — simple-flyer VLM detector
- patterns/hybrid-cv-plus-llm-pipeline — the canonical pattern this system instantiates
- patterns/complexity-tiered-model-selection — simple vs. complex routing
- systems/instacart-pixel — sibling visual-ML platform (generation side)
- systems/instacart-parse — sibling visual-ML platform (attribute-extraction side)
- companies/instacart