PATTERN Cited by 1 source

Hybrid CV + LLM Pipeline¶

Definition¶

A hybrid computer-vision + LLM pipeline decomposes a visual understanding task into two (or more) phases: an early classical-CV / purpose-trained-model phase for geometric / localization work, and a later LLM (or multimodal-LLM) phase for semantic / identification / reasoning work. Each phase is implemented with the model class best suited to its sub-task, rather than trying to solve the full task end-to-end with a single multimodal model.

The canonical instance on the wiki is localization + product identification:

Phase 1 (CV): find where each object is on the image — bounding-box segmentation via SAM + post-processing.
Phase 2 (LLM): identify which object it is — OCR + LLM reasoning + catalog search.

Why it matters¶

End-to-end multimodal LLMs are tempting: "just ask the VLM for (box, product) pairs in one shot." In practice this tends to fail on non-trivial images because the two sub-tasks have different model-capability profiles:

Localization needs pixel-accurate spatial reasoning. VLMs are weaker at this than purpose-trained segmentation models, which have been optimised on mask datasets for years.
Identification needs broad world knowledge and reasoning over text (OCR) + visual features + a catalog. Classical CV systems are weak at this; it's the natural LLM regime.

Decomposing the task lets each phase use the best-in-class tool for its sub-problem, at the cost of an explicit interface between the phases (the bounding-box set).

Mechanism¶

The general shape:

input image
    │
    ▼
Phase 1: localisation (where?)
  - purpose-trained segmentation/detection model
  - classical CV post-processing (WBF, heuristics, ensembles)
  - output: { box_i } — bounding boxes around regions of interest
    │
    ▼
Phase 2: identification (what?)
  - OCR on each box's contents
  - LLM/VLM reasoning over OCR + image
  - retrieval/search against domain-specific index
  - output: { (box_i, entity_i, confidence_i) }
    │
    ▼
downstream consumer

Variants:

Single-phase fallback for simple inputs. See patterns/complexity-tiered-model-selection — for easy inputs, a single-phase VLM may suffice. The pattern is about the shape of the pipeline on the hard path, not a prescription for every input.
Iterative refinement across phases. Phase 2 can route low-confidence identifications back to Phase 1 for box re-cropping or re-segmentation.
HITL checkpoint between phases. Optional human review of Phase-1 outputs before Phase-2 spends LLM budget (not disclosed as present in Instacart's pipeline, but a natural extension).

Why two phases, not one multimodal model¶

Three reasons the Instacart team — and teams generally — decompose:

Sub-task accuracy. Purpose-trained segmentation/detection models are the state of the art for localization; VLMs are the state of the art for identification. Using each in its strong domain wins.
Independent improvement surface. Phase-1 and Phase-2 can be upgraded, debugged, and evaluated independently. A Phase-2 LLM swap doesn't force a Phase-1 retraining, and vice versa.
Cost control. LLM calls are typically far more expensive than a segmentation-model forward pass. Running the LLM only on pre-cropped boxes (instead of on the whole flyer) both reduces token/pixel cost and improves per-box LLM accuracy.

Tradeoffs / gotchas¶

The Phase-1 → Phase-2 interface is a failure mode. A missed box in Phase 1 is invisible to Phase 2; a false- positive box in Phase 1 costs Phase-2 compute and risks a hallucinated product match. Phase-1 precision is load-bearing.
Confidence composition is non-trivial. Each phase emits its own confidence; combining them into a shippable end-to-end confidence requires care (not additive, typically multiplicative with calibration).
Harder to train end-to-end. If you eventually want a unified end-to-end model, a two-phase pipeline's training signals don't compose easily — the Phase-2 LLM sees Phase-1 outputs, not ground-truth boxes, so errors propagate.
Operational complexity. Two phases = two models to version, two deployment stacks, two monitoring surfaces. Justifies the overhead only when sub-task profiles are different enough.

Seen in¶

sources/2026-02-09-instacart-from-print-to-digital-making-weekly-flyers-shoppable — canonical wiki instance. Instacart's flyer- digitization pipeline explicitly separates Phase 1 (SAM-based segmentation + post-processing) from Phase 2 (OCR + LLM + catalog search), with manual workflow (3–4 h per flyer) collapsed to <30 min end-to-end. Phase-2 details are truncated in the captured body.

patterns/complexity-tiered-model-selection — sibling pattern of routing by input complexity (simple inputs may skip the multi-phase path)
patterns/multi-stage-extraction-pipeline — adjacent wiki pattern (general multi-stage extraction, not CV-specific)
patterns/vlm-evaluator-quality-gate — sibling pattern using a VLM as a judge instead of as an identifier
systems/instacart-flyer-digitization-pipeline — canonical production instance
systems/segment-anything-model-sam — the Phase-1 base model in the canonical instance
concepts/weighted-boxes-fusion — Phase-1 post-processing technique
companies/instacart