CONCEPT Cited by 1 source

Iterative Coordinate Grid Probing¶

Definition¶

Iterative coordinate grid probing is a technique for extracting precise bounding-box coordinates from a multimodal LLM (VLM) on images where the model cannot reliably return pixel-accurate boxes on its first try. The technique turns coordinate extraction into a recursive localization problem: overlay a uniform grid on the image, ask the VLM which grid cell a target region begins in, then subdivide that cell and repeat until you converge to a box.

This is distinct from promptable segmentation (where a VLM takes a prompt and returns a mask) and from direct box prediction (where a VLM is asked "output (x1, y1, x2, y2) for each product"). Direct box prediction is the natural approach but frequently fails on non-trivial images; iterative grid probing trades more VLM calls for more reliable outputs on images that are simple enough for the technique to converge.

Mechanism¶

The Instacart-described recipe:

Overlay a uniform grid on the image. Draw N×N grid lines across the flyer (N small, e.g. 4×4).
Ask the VLM for the first box's starting cell. Prompt: "where does the first product box begin (X, Y coordinates)?"
Subdivide the identified region. Within the cell the VLM named, draw a finer grid.
Ask for start + end coordinates of each box. Recursively ask the VLM for the starting and ending coordinates of every box within the subdivided region.
Stop when converged. Output the collected boxes.

The drawing step is load-bearing: the VLM is shown an image that already has the grid lines rendered on it, giving the VLM a visual coordinate system to anchor its answer against, rather than asking it to reason in abstract pixel space.

Why it matters¶

Multimodal LLMs are strong at answering "what is in this image?" questions and relatively weaker at "where is it, precisely?" questions. Direct pixel-coordinate prediction often produces plausible-looking but imprecise boxes. The iterative grid technique works around the weakness:

Grids give the VLM a discrete frame of reference. "Which cell?" is an easier question than "which pixel?" — the answer lives in a small enumerable set.
Recursion amplifies precision. Each grid subdivision gives another bit of positional information. A few rounds yield box coordinates accurate enough for downstream use without the VLM ever producing a numerical pixel count directly.
No model fine-tuning required. The technique is purely prompting + image manipulation on a frozen off-the-shelf VLM.

Where it works — and where it doesn't¶

Instacart reports ~90% accuracy using iterative-grid probing on simple flyers (well-separated boxes, few in number). They explicitly do not use this technique for complex flyers:

"However, for more complex flyer images like [figure 5], multimodal LLMs produce imprecise bounding boxes." (Source: sources/2026-02-09-instacart-from-print-to-digital-making-weekly-flyers-shoppable)

For complex flyers the team falls back to a pipeline built on SAM with post- processing. This is the motivating example for patterns/complexity-tiered-model-selection: pick the detector by input complexity, don't use one globally.

Iterative grid probing scales well to simple inputs:

Few, well-separated boxes
High contrast between boxes and background
No overlapping or occluded items
Uniform grid layouts

It fails on:

Dense, overlapping detections (the grid-cell unit gets multiple boxes)
Decorative elements the VLM mistakes for products
Irregular layouts where grid cells don't align with box boundaries

Tradeoffs / gotchas¶

Multiple VLM calls per image. Each grid round is a separate inference call. Cost scales with box count + grid depth. For N boxes at k levels of subdivision, expect O(N × k) calls.
Rendering the grid is part of the input. The pipeline must render grid lines onto a copy of the image before each call; the VLM does not invent the grid.
Answer quality depends on VLM spatial-reasoning ability. VLMs vary widely in grid-cell-identification accuracy. Test empirically before committing.
Not a general replacement for a segmentation model. At Instacart, it ships only for the simple-flyer tier; the complex tier uses SAM + post-processing.

Seen in¶

sources/2026-02-09-instacart-from-print-to-digital-making-weekly-flyers-shoppable — canonical wiki instance. Instacart's flyer-digitization pipeline uses iterative grid probing as the simple-flyer detector in Phase 1, reporting ~90% accuracy on well-separated few-box flyers. The technique is explicitly not used for complex flyers, where SAM takes over.

concepts/vlm-as-image-judge — sibling VLM-in-pipeline usage at Instacart (quality gate, not coordinate extractor)
patterns/complexity-tiered-model-selection — the pattern iterative grid probing lives inside
systems/instacart-flyer-digitization-pipeline — canonical production use
companies/instacart