PATTERN Cited by 1 source
VLM-assisted pre-labeling¶
Intent¶
Replace human-from-scratch annotation with a two-stage pipeline where a Vision-Language Model + internal teacher models generate high-quality pre-labels, and humans only correct them. Cuts per-label time and cost sharply (Instacart Capsight projects >70% cost reduction; multi-day labelling tasks shrink to hours) without surrendering the accuracy guarantee that comes from a human in the loop.
When to use¶
- Training-data pipeline is bottlenecked on annotation throughput — labels can't keep up with collected data.
- The labelling task is well-defined (item recognition, barcode transcription, bounding-box placement) so a VLM's output is a credible draft, not a wild guess.
- A teacher model is already available — either an earlier production model, a frontier LLM/VLM accessed via API, or a distilled-from-larger-model ensemble. ([[concepts/knowledge- distillation]].)
- The annotation budget is significant enough that a 50–70% reduction pays for the VLM inference costs several times over.
Mechanism¶
- Filter obvious negatives first. Before invoking the VLM, run a cheap filter to drop clearly uninteresting data — Capsight filters empty-background images before pre-labelling.
- Run VLM + teacher models to generate pre-labels. The VLM handles broad item/bbox proposals; internal teacher models (often prior-generation production models or ensembles) refine specific dimensions like barcode transcription. Output is a fully-formed candidate label set per image.
- Route to human annotators for correction, not creation. Annotators see an image with labels already on it and either accept, tweak, or reject. Cognitive load per image drops; throughput per annotator-hour rises.
- Feed corrections back into the teacher pool. Corrected
labels become new training data for:
- the production model (the primary goal),
- and, optionally, the teacher model itself (closing a secondary flywheel on labelling quality).
- Reuse the same pipeline to clean historical ground truth. Capsight explicitly notes the VLM-assisted flow "allows us to efficiently clean errors from our historical ground truth data" — the same infrastructure that labels new data finds mistakes in old data.
Compared to adjacent patterns¶
- patterns/human-calibrated-llm-labeling — same structural idea for general LLM labelling; VLM-assisted pre-labeling is the computer-vision + multimodal specialisation, with the additional teacher model detail.
- patterns/low-confidence-to-human-review routes only low-confidence outputs to humans. VLM-assisted pre- labeling routes all outputs through humans for correction, but reduces per-image time to seconds. They can compose: pre-label everything, auto-accept above a confidence threshold, send only low-confidence pre-labels to humans.
- patterns/data-driven-annotation-curation decides which images get labelled at all; VLM-assisted pre-labeling decides how the labelling happens once an image is picked. They compose in the obvious way.
- concepts/vlm-as-image-judge / [[patterns/vlm-evaluator- quality-gate]] use a VLM to score a model's output — this pattern uses a VLM to seed training-label creation. Same underlying capability, different pipeline position.
Why it works¶
Vision-language models are now good enough at general object recognition, bounding-box hints, and short-string transcription (like barcodes) that their raw outputs exceed the accuracy floor humans need to start from. The inference economics are also favourable: a single VLM call on one image is orders of magnitude cheaper than a human annotator's time for the same image. The value-add of the human shrinks from "produce the whole label" to "catch residual errors" — a qualitatively easier, faster task.
Caveats¶
- Teacher-model quality drives ceiling quality. If the VLM confidently generates wrong labels and humans rubber- stamp them, the flywheel amplifies errors. Mitigate with audit sampling (random draw of pre-label + correction pairs reviewed by senior annotators) — patterns/human-in-the-loop-quality-sampling.
- Task fit matters. Pre-labelling works for structured, recognisable outputs. For fuzzy / subjective labels ("is this product image on-brand?"), pre-labels help less.
- Tooling matters. If the annotation UI doesn't make accepting a pre-label a single click, the claimed speedup doesn't materialise.
Seen in¶
- systems/capsight Depot — VLM + teacher models pre-label items + barcodes for Caper smart-cart training data. Projected >70% annotation cost reduction; multi-day tasks → hours; same pipeline cleans historical ground truth. (Source: sources/2026-02-17-instacart-turning-data-into-velocity-capers-edge-and-cloud-data-flywheel-with-capsight)