Skip to content

SYSTEM Cited by 1 source

Instacart Flyer Digitization Pipeline

Definition

Instacart's Flyer Digitization Pipeline is the internal computer-vision + LLM system that converts retailer-supplied weekly grocery flyer images (the printed-advertisement-style promotional spread that used to ship in newspapers) into shoppable, interactive, tap-to-add-to-cart product tiles on the Instacart app. The pipeline ingests a flyer image and outputs, for each promotional deal on the flyer, (box, catalog- product) pairs that the Instacart frontend renders as clickable overlays on the original flyer layout.

The pipeline replaces a manual digitization workflow (human draws bounding boxes around every deal + human matches each box to a catalog product) that cost 3–4 hours per flyer and did not scale beyond a handful of retailers. Post-automation, end-to- end runtime is under 30 minutes per flyer. (Source: sources/2026-02-09-instacart-from-print-to-digital-making-weekly-flyers-shoppable)

Architecture

Two-phase pipeline

The pipeline is explicitly two phases, decomposed so that where is the product on the page is solved independently of which product is it.

flyer image upload
Phase 1: Image Segmentation
  ├─ simple-flyer detector: multimodal-LLM iterative-grid probing
  └─ complex-flyer detector:
        Meta SAM
        ├→ text-box removal
        ├→ Weighted Boxes Fusion (merge overlapping boxes)
        ├→ model ensemble (SAM + contour detection, gated per retailer)
        └→ heuristic + ML filtering (aspect ratio, size, noise)
Phase 2: Product Identification
  ├─ OCR on box contents
  ├─ LLM reasoning over extracted text + image
  └─ internal catalog search
(box, catalog_product) pairs → rendered as interactive overlays

Phase 1 — Image Segmentation

Phase 1 outputs a bounding box around every product / deal on the flyer. The team evaluated three naive approaches and rejected all three:

  • Off-the-shelf food-specific segmentation ( FoodSAM). Food-aware SAM variant. Rejected: "fell short of addressing the breadth and variety of products featured in retail flyers" — retail flyers include branded packaged goods, household products, and fresh produce, not just food.
  • Pure multimodal LLMs (VLMs) for the full flyer. Rejected for complex flyers: "multimodal LLMs produce imprecise bounding boxes." But kept for simple flyers (see complexity-tiered selection below).
  • Traditional segmentation / contour detection standalone. Rejected: "generated excessive noise, rendering their outputs unusable without extensive post-processing."

The shipped architecture is a hybrid on two axes:

Axis 1 — complexity-tiered model selection

Flyer complexity determines which detector stack runs (see patterns/complexity-tiered-model-selection).

  • Simple flyers (well-separated boxes, few in number): use iterative-grid multimodal-LLM probing. Draw uniform grid lines on the image, ask the VLM "where does the first box begin (X/Y)?", then subdivide the identified box and ask for the starting and ending coordinates for each segmentation box recursively. Reported accuracy: ~90%.
  • Complex flyers (overlapping products, decorative text, varying layouts): use the SAM-based post-processed stack described below.

The tier-assignment signal itself is not disclosed in the captured body.

Axis 2 — the SAM-based complex-flyer stack

Built on Meta's Segment Anything Model (SAM) as the base detector; four post-processing stages convert SAM's raw output into usable product boxes:

  1. Text-box removal. Strip detected boxes that correspond to decorative elements / promotional text (prices, brand banners, savings callouts) rather than products.
  2. Weighted Boxes Fusion (WBF). Merge overlapping boxes via confidence-weighted coordinate averaging. Chosen over classical Non-Maximum Suppression because NMS "may discard valuable information by eliminating lower-confidence boxes," whereas WBF preserves all overlapping-box information in the merged output.
  3. Model ensembling. Combine SAM-style segmentation outputs with classical contour-detection outputs. The contour model is gated per retailer based on flyer density: "the decision whether or not to use contour detection models was based on how densely the flyer images were packed. This varied from retailer to retailer."
  4. Heuristic + ML filtering. Reject false-positive boxes using (a) hand-written heuristics on relative size and aspect ratio of bounding boxes and (b) ML-trained filters specifically classifying valid product boxes vs. noise. Both kinds of filter are retained — the heuristics are not subsumed by the ML filter.

Phase 2 — Product Identification

Phase 2 takes each Phase-1 bounding box and matches it to a concrete Instacart catalog SKU. The post names three components:

  • Optical Character Recognition (OCR) on the box's pixel content to extract brand names, product names, sizes, prices.
  • Large language models reasoning over the OCR text + the box image itself to identify product + attributes.
  • Instacart's existing search infrastructure to resolve the LLM's proposed product into a real catalog SKU.

Stated Phase-2 challenges:

  • Multi-item deals. A single bounding box may correspond to a deal that bundles multiple catalog SKUs.
  • Generic produce. Items like "bananas" or "avocados" have no branded text to OCR and are matched by image-level reasoning against the produce catalog.

⚠️ Phase 2 captured-body truncation. The raw markdown on the wiki ends after Phase 1. The published article very likely contains additional Phase-2 architectural detail (LLM model choice, cache design, confidence thresholds, HITL fallback) that is not reflected here. Re-ingest when a longer capture is available.

Production numbers

  • Before: 3–4 hours per flyer manual work; "hundreds of hours each week" aggregated across retailers; manual workflow required retailers to submit flyers in advance.
  • After: <30 minutes end-to-end per flyer; no retailer-advance-submission requirement disclosed as binding.
  • Simple-flyer tier accuracy: ~90% on box extraction.
  • Complex-flyer tier accuracy: not disclosed in the captured body beyond qualitative "accurately extract most of our targeted bounding boxes."

Design lessons

  • Decompose localization from identification. The pipeline does not try to use a single multimodal model to output (box, product) jointly — the two phases are independently improvable and independently failure-domained. This is the canonical instance on the wiki of patterns/hybrid-cv-plus-llm-pipeline.
  • Route inputs to model stacks by complexity, not one-size-fits-all. Simple flyers get a cheap VLM-probing pipeline; complex flyers get the expensive SAM + ensemble pipeline. Per-retailer tuning is also a form of this: dense flyers get the contour ensemble, sparse flyers don't. See patterns/complexity-tiered-model-selection.
  • Foundation-model output is usable after post-processing, not out-of-the-box. SAM is the base, but four post-processing stages are load-bearing. Teams building on foundation models should budget engineering effort for domain-specific post-processing, not assume the foundation model's direct output is shippable.
  • Retain classical techniques alongside ML. Heuristic aspect-ratio + size filters sit next to trained ML filters, and contour detection sits next to SAM. ML doesn't subsume the classical primitives — it augments them where they fail.
  • Merge overlapping detections by averaging, not by suppression. WBF over NMS when you have multiple detectors that sometimes disagree: averaging retains low-confidence information that NMS would drop. See concepts/weighted-boxes-fusion + concepts/non-maximum-suppression.

Relationship to other Instacart systems

  • PIXEL (image generation platform) and PARSE (attribute extraction platform) are sibling Instacart visual-ML systems. The flyer pipeline is the third pillar on the wiki: visual layout extraction from retailer-supplied images. All three systems share the Instacart architectural stance of model-agnostic, multi-stage, LLM-in-the-loop visual pipelines, but this one has no explicit unified-platform framing — the post describes the pipeline, not a general platform.
  • concepts/vlm-as-image-judge (PIXEL's quality-gate pattern) and this pipeline's VLM usage are distinct: PIXEL uses a VLM as a judge after generation; this pipeline uses a VLM as a detector for simple-flyer bounding-box coordinates.
  • concepts/multi-modal-attribute-extraction (PARSE's text+image reasoning) and this pipeline's Phase-2 LLM usage are architecturally similar — OCR + image reasoning for product identification is a specialisation of PARSE-style multi-modal extraction, but scoped to catalog-matching rather than attribute extraction.

Seen in

Last updated · 319 distilled / 1,201 read