Skip to content

PATTERN Cited by 1 source

Hybrid CV + LLM Pipeline

Definition

A hybrid computer-vision + LLM pipeline decomposes a visual understanding task into two (or more) phases: an early classical-CV / purpose-trained-model phase for geometric / localization work, and a later LLM (or multimodal-LLM) phase for semantic / identification / reasoning work. Each phase is implemented with the model class best suited to its sub-task, rather than trying to solve the full task end-to-end with a single multimodal model.

The canonical instance on the wiki is localization + product identification:

  • Phase 1 (CV): find where each object is on the image — bounding-box segmentation via SAM + post-processing.
  • Phase 2 (LLM): identify which object it is — OCR + LLM reasoning + catalog search.

Why it matters

End-to-end multimodal LLMs are tempting: "just ask the VLM for (box, product) pairs in one shot." In practice this tends to fail on non-trivial images because the two sub-tasks have different model-capability profiles:

  • Localization needs pixel-accurate spatial reasoning. VLMs are weaker at this than purpose-trained segmentation models, which have been optimised on mask datasets for years.
  • Identification needs broad world knowledge and reasoning over text (OCR) + visual features + a catalog. Classical CV systems are weak at this; it's the natural LLM regime.

Decomposing the task lets each phase use the best-in-class tool for its sub-problem, at the cost of an explicit interface between the phases (the bounding-box set).

Mechanism

The general shape:

input image
Phase 1: localisation (where?)
  - purpose-trained segmentation/detection model
  - classical CV post-processing (WBF, heuristics, ensembles)
  - output: { box_i } — bounding boxes around regions of interest
Phase 2: identification (what?)
  - OCR on each box's contents
  - LLM/VLM reasoning over OCR + image
  - retrieval/search against domain-specific index
  - output: { (box_i, entity_i, confidence_i) }
downstream consumer

Variants:

  • Single-phase fallback for simple inputs. See patterns/complexity-tiered-model-selection — for easy inputs, a single-phase VLM may suffice. The pattern is about the shape of the pipeline on the hard path, not a prescription for every input.
  • Iterative refinement across phases. Phase 2 can route low-confidence identifications back to Phase 1 for box re-cropping or re-segmentation.
  • HITL checkpoint between phases. Optional human review of Phase-1 outputs before Phase-2 spends LLM budget (not disclosed as present in Instacart's pipeline, but a natural extension).

Why two phases, not one multimodal model

Three reasons the Instacart team — and teams generally — decompose:

  1. Sub-task accuracy. Purpose-trained segmentation/detection models are the state of the art for localization; VLMs are the state of the art for identification. Using each in its strong domain wins.
  2. Independent improvement surface. Phase-1 and Phase-2 can be upgraded, debugged, and evaluated independently. A Phase-2 LLM swap doesn't force a Phase-1 retraining, and vice versa.
  3. Cost control. LLM calls are typically far more expensive than a segmentation-model forward pass. Running the LLM only on pre-cropped boxes (instead of on the whole flyer) both reduces token/pixel cost and improves per-box LLM accuracy.

Tradeoffs / gotchas

  • The Phase-1 → Phase-2 interface is a failure mode. A missed box in Phase 1 is invisible to Phase 2; a false- positive box in Phase 1 costs Phase-2 compute and risks a hallucinated product match. Phase-1 precision is load-bearing.
  • Confidence composition is non-trivial. Each phase emits its own confidence; combining them into a shippable end-to-end confidence requires care (not additive, typically multiplicative with calibration).
  • Harder to train end-to-end. If you eventually want a unified end-to-end model, a two-phase pipeline's training signals don't compose easily — the Phase-2 LLM sees Phase-1 outputs, not ground-truth boxes, so errors propagate.
  • Operational complexity. Two phases = two models to version, two deployment stacks, two monitoring surfaces. Justifies the overhead only when sub-task profiles are different enough.

Seen in

Last updated · 319 distilled / 1,201 read