SYSTEM Cited by 1 source

Instacart Flyer Digitization Pipeline¶

Definition¶

Instacart's Flyer Digitization Pipeline is the internal computer-vision + LLM system that converts retailer-supplied weekly grocery flyer images (the printed-advertisement-style promotional spread that used to ship in newspapers) into shoppable, interactive, tap-to-add-to-cart product tiles on the Instacart app. The pipeline ingests a flyer image and outputs, for each promotional deal on the flyer, (box, catalog- product) pairs that the Instacart frontend renders as clickable overlays on the original flyer layout.

The pipeline replaces a manual digitization workflow (human draws bounding boxes around every deal + human matches each box to a catalog product) that cost 3–4 hours per flyer and did not scale beyond a handful of retailers. Post-automation, end-to- end runtime is under 30 minutes per flyer. (Source: sources/2026-02-09-instacart-from-print-to-digital-making-weekly-flyers-shoppable)

Architecture¶

Two-phase pipeline¶

The pipeline is explicitly two phases, decomposed so that where is the product on the page is solved independently of which product is it.

flyer image upload
       │
       ▼
Phase 1: Image Segmentation
  ├─ simple-flyer detector: multimodal-LLM iterative-grid probing
  └─ complex-flyer detector:
        Meta SAM
        ├→ text-box removal
        ├→ Weighted Boxes Fusion (merge overlapping boxes)
        ├→ model ensemble (SAM + contour detection, gated per retailer)
        └→ heuristic + ML filtering (aspect ratio, size, noise)
       │
       ▼
Phase 2: Product Identification
  ├─ OCR on box contents
  ├─ LLM reasoning over extracted text + image
  └─ internal catalog search
       │
       ▼
(box, catalog_product) pairs → rendered as interactive overlays

Phase 1 — Image Segmentation¶

Phase 1 outputs a bounding box around every product / deal on the flyer. The team evaluated three naive approaches and rejected all three:

Off-the-shelf food-specific segmentation ( FoodSAM). Food-aware SAM variant. Rejected: "fell short of addressing the breadth and variety of products featured in retail flyers" — retail flyers include branded packaged goods, household products, and fresh produce, not just food.
Pure multimodal LLMs (VLMs) for the full flyer. Rejected for complex flyers: "multimodal LLMs produce imprecise bounding boxes." But kept for simple flyers (see complexity-tiered selection below).
Traditional segmentation / contour detection standalone. Rejected: "generated excessive noise, rendering their outputs unusable without extensive post-processing."

The shipped architecture is a hybrid on two axes:

Axis 1 — complexity-tiered model selection¶

Flyer complexity determines which detector stack runs (see patterns/complexity-tiered-model-selection).

Simple flyers (well-separated boxes, few in number): use iterative-grid multimodal-LLM probing. Draw uniform grid lines on the image, ask the VLM "where does the first box begin (X/Y)?", then subdivide the identified box and ask for the starting and ending coordinates for each segmentation box recursively. Reported accuracy: ~90%.
Complex flyers (overlapping products, decorative text, varying layouts): use the SAM-based post-processed stack described below.

The tier-assignment signal itself is not disclosed in the captured body.

Axis 2 — the SAM-based complex-flyer stack¶

Built on Meta's Segment Anything Model (SAM) as the base detector; four post-processing stages convert SAM's raw output into usable product boxes:

Text-box removal. Strip detected boxes that correspond to decorative elements / promotional text (prices, brand banners, savings callouts) rather than products.
Weighted Boxes Fusion (WBF). Merge overlapping boxes via confidence-weighted coordinate averaging. Chosen over classical Non-Maximum Suppression because NMS "may discard valuable information by eliminating lower-confidence boxes," whereas WBF preserves all overlapping-box information in the merged output.
Model ensembling. Combine SAM-style segmentation outputs with classical contour-detection outputs. The contour model is gated per retailer based on flyer density: "the decision whether or not to use contour detection models was based on how densely the flyer images were packed. This varied from retailer to retailer."
Heuristic + ML filtering. Reject false-positive boxes using (a) hand-written heuristics on relative size and aspect ratio of bounding boxes and (b) ML-trained filters specifically classifying valid product boxes vs. noise. Both kinds of filter are retained — the heuristics are not subsumed by the ML filter.

Phase 2 — Product Identification¶

Phase 2 takes each Phase-1 bounding box and matches it to a concrete Instacart catalog SKU. The post names three components:

Optical Character Recognition (OCR) on the box's pixel content to extract brand names, product names, sizes, prices.
Large language models reasoning over the OCR text + the box image itself to identify product + attributes.
Instacart's existing search infrastructure to resolve the LLM's proposed product into a real catalog SKU.

Stated Phase-2 challenges:

Multi-item deals. A single bounding box may correspond to a deal that bundles multiple catalog SKUs.
Generic produce. Items like "bananas" or "avocados" have no branded text to OCR and are matched by image-level reasoning against the produce catalog.

⚠️ Phase 2 captured-body truncation. The raw markdown on the wiki ends after Phase 1. The published article very likely contains additional Phase-2 architectural detail (LLM model choice, cache design, confidence thresholds, HITL fallback) that is not reflected here. Re-ingest when a longer capture is available.

Production numbers¶

Before: 3–4 hours per flyer manual work; "hundreds of hours each week" aggregated across retailers; manual workflow required retailers to submit flyers in advance.
After: <30 minutes end-to-end per flyer; no retailer-advance-submission requirement disclosed as binding.
Simple-flyer tier accuracy: ~90% on box extraction.
Complex-flyer tier accuracy: not disclosed in the captured body beyond qualitative "accurately extract most of our targeted bounding boxes."

Design lessons¶

Decompose localization from identification. The pipeline does not try to use a single multimodal model to output (box, product) jointly — the two phases are independently improvable and independently failure-domained. This is the canonical instance on the wiki of patterns/hybrid-cv-plus-llm-pipeline.
Route inputs to model stacks by complexity, not one-size-fits-all. Simple flyers get a cheap VLM-probing pipeline; complex flyers get the expensive SAM + ensemble pipeline. Per-retailer tuning is also a form of this: dense flyers get the contour ensemble, sparse flyers don't. See patterns/complexity-tiered-model-selection.
Foundation-model output is usable after post-processing, not out-of-the-box. SAM is the base, but four post-processing stages are load-bearing. Teams building on foundation models should budget engineering effort for domain-specific post-processing, not assume the foundation model's direct output is shippable.
Retain classical techniques alongside ML. Heuristic aspect-ratio + size filters sit next to trained ML filters, and contour detection sits next to SAM. ML doesn't subsume the classical primitives — it augments them where they fail.
Merge overlapping detections by averaging, not by suppression. WBF over NMS when you have multiple detectors that sometimes disagree: averaging retains low-confidence information that NMS would drop. See concepts/weighted-boxes-fusion + concepts/non-maximum-suppression.

Relationship to other Instacart systems¶

PIXEL (image generation platform) and PARSE (attribute extraction platform) are sibling Instacart visual-ML systems. The flyer pipeline is the third pillar on the wiki: visual layout extraction from retailer-supplied images. All three systems share the Instacart architectural stance of model-agnostic, multi-stage, LLM-in-the-loop visual pipelines, but this one has no explicit unified-platform framing — the post describes the pipeline, not a general platform.
concepts/vlm-as-image-judge (PIXEL's quality-gate pattern) and this pipeline's VLM usage are distinct: PIXEL uses a VLM as a judge after generation; this pipeline uses a VLM as a detector for simple-flyer bounding-box coordinates.
concepts/multi-modal-attribute-extraction (PARSE's text+image reasoning) and this pipeline's Phase-2 LLM usage are architecturally similar — OCR + image reasoning for product identification is a specialisation of PARSE-style multi-modal extraction, but scoped to catalog-matching rather than attribute extraction.

Seen in¶

sources/2026-02-09-instacart-from-print-to-digital-making-weekly-flyers-shoppable — canonical wiki instance. Instacart Engineering's 2026-02-09 post describing the two-phase flyer-digitization pipeline. Phase-1 is fully described; Phase-2 is named only at the component level in the captured body (⚠️ truncation caveat).

systems/segment-anything-model-sam — the Phase-1 base detector
concepts/weighted-boxes-fusion — detector merging technique used in Phase 1
concepts/non-maximum-suppression — the classical baseline WBF replaces here
concepts/model-ensembling-for-detection — segmentation + contour ensemble
concepts/iterative-coordinate-grid-probing — simple-flyer VLM detector
patterns/hybrid-cv-plus-llm-pipeline — the canonical pattern this system instantiates
patterns/complexity-tiered-model-selection — simple vs. complex routing
systems/instacart-pixel — sibling visual-ML platform (generation side)
systems/instacart-parse — sibling visual-ML platform (attribute-extraction side)
companies/instacart