PATTERN Cited by 1 source
Hybrid CV + LLM Pipeline¶
Definition¶
A hybrid computer-vision + LLM pipeline decomposes a visual understanding task into two (or more) phases: an early classical-CV / purpose-trained-model phase for geometric / localization work, and a later LLM (or multimodal-LLM) phase for semantic / identification / reasoning work. Each phase is implemented with the model class best suited to its sub-task, rather than trying to solve the full task end-to-end with a single multimodal model.
The canonical instance on the wiki is localization + product identification:
- Phase 1 (CV): find where each object is on the image — bounding-box segmentation via SAM + post-processing.
- Phase 2 (LLM): identify which object it is — OCR + LLM reasoning + catalog search.
Why it matters¶
End-to-end multimodal LLMs are tempting: "just ask the VLM for (box, product) pairs in one shot." In practice this tends to fail on non-trivial images because the two sub-tasks have different model-capability profiles:
- Localization needs pixel-accurate spatial reasoning. VLMs are weaker at this than purpose-trained segmentation models, which have been optimised on mask datasets for years.
- Identification needs broad world knowledge and reasoning over text (OCR) + visual features + a catalog. Classical CV systems are weak at this; it's the natural LLM regime.
Decomposing the task lets each phase use the best-in-class tool for its sub-problem, at the cost of an explicit interface between the phases (the bounding-box set).
Mechanism¶
The general shape:
input image
│
▼
Phase 1: localisation (where?)
- purpose-trained segmentation/detection model
- classical CV post-processing (WBF, heuristics, ensembles)
- output: { box_i } — bounding boxes around regions of interest
│
▼
Phase 2: identification (what?)
- OCR on each box's contents
- LLM/VLM reasoning over OCR + image
- retrieval/search against domain-specific index
- output: { (box_i, entity_i, confidence_i) }
│
▼
downstream consumer
Variants:
- Single-phase fallback for simple inputs. See patterns/complexity-tiered-model-selection — for easy inputs, a single-phase VLM may suffice. The pattern is about the shape of the pipeline on the hard path, not a prescription for every input.
- Iterative refinement across phases. Phase 2 can route low-confidence identifications back to Phase 1 for box re-cropping or re-segmentation.
- HITL checkpoint between phases. Optional human review of Phase-1 outputs before Phase-2 spends LLM budget (not disclosed as present in Instacart's pipeline, but a natural extension).
Why two phases, not one multimodal model¶
Three reasons the Instacart team — and teams generally — decompose:
- Sub-task accuracy. Purpose-trained segmentation/detection models are the state of the art for localization; VLMs are the state of the art for identification. Using each in its strong domain wins.
- Independent improvement surface. Phase-1 and Phase-2 can be upgraded, debugged, and evaluated independently. A Phase-2 LLM swap doesn't force a Phase-1 retraining, and vice versa.
- Cost control. LLM calls are typically far more expensive than a segmentation-model forward pass. Running the LLM only on pre-cropped boxes (instead of on the whole flyer) both reduces token/pixel cost and improves per-box LLM accuracy.
Tradeoffs / gotchas¶
- The Phase-1 → Phase-2 interface is a failure mode. A missed box in Phase 1 is invisible to Phase 2; a false- positive box in Phase 1 costs Phase-2 compute and risks a hallucinated product match. Phase-1 precision is load-bearing.
- Confidence composition is non-trivial. Each phase emits its own confidence; combining them into a shippable end-to-end confidence requires care (not additive, typically multiplicative with calibration).
- Harder to train end-to-end. If you eventually want a unified end-to-end model, a two-phase pipeline's training signals don't compose easily — the Phase-2 LLM sees Phase-1 outputs, not ground-truth boxes, so errors propagate.
- Operational complexity. Two phases = two models to version, two deployment stacks, two monitoring surfaces. Justifies the overhead only when sub-task profiles are different enough.
Seen in¶
- sources/2026-02-09-instacart-from-print-to-digital-making-weekly-flyers-shoppable — canonical wiki instance. Instacart's flyer- digitization pipeline explicitly separates Phase 1 (SAM-based segmentation + post-processing) from Phase 2 (OCR + LLM + catalog search), with manual workflow (3–4 h per flyer) collapsed to <30 min end-to-end. Phase-2 details are truncated in the captured body.
Related¶
- patterns/complexity-tiered-model-selection — sibling pattern of routing by input complexity (simple inputs may skip the multi-phase path)
- patterns/multi-stage-extraction-pipeline — adjacent wiki pattern (general multi-stage extraction, not CV-specific)
- patterns/vlm-evaluator-quality-gate — sibling pattern using a VLM as a judge instead of as an identifier
- systems/instacart-flyer-digitization-pipeline — canonical production instance
- systems/segment-anything-model-sam — the Phase-1 base model in the canonical instance
- concepts/weighted-boxes-fusion — Phase-1 post-processing technique
- companies/instacart