Skip to content

PATTERN Cited by 1 source

VLM-assisted pre-labeling

Intent

Replace human-from-scratch annotation with a two-stage pipeline where a Vision-Language Model + internal teacher models generate high-quality pre-labels, and humans only correct them. Cuts per-label time and cost sharply (Instacart Capsight projects >70% cost reduction; multi-day labelling tasks shrink to hours) without surrendering the accuracy guarantee that comes from a human in the loop.

When to use

  • Training-data pipeline is bottlenecked on annotation throughput — labels can't keep up with collected data.
  • The labelling task is well-defined (item recognition, barcode transcription, bounding-box placement) so a VLM's output is a credible draft, not a wild guess.
  • A teacher model is already available — either an earlier production model, a frontier LLM/VLM accessed via API, or a distilled-from-larger-model ensemble. ([[concepts/knowledge- distillation]].)
  • The annotation budget is significant enough that a 50–70% reduction pays for the VLM inference costs several times over.

Mechanism

  1. Filter obvious negatives first. Before invoking the VLM, run a cheap filter to drop clearly uninteresting data — Capsight filters empty-background images before pre-labelling.
  2. Run VLM + teacher models to generate pre-labels. The VLM handles broad item/bbox proposals; internal teacher models (often prior-generation production models or ensembles) refine specific dimensions like barcode transcription. Output is a fully-formed candidate label set per image.
  3. Route to human annotators for correction, not creation. Annotators see an image with labels already on it and either accept, tweak, or reject. Cognitive load per image drops; throughput per annotator-hour rises.
  4. Feed corrections back into the teacher pool. Corrected labels become new training data for:
    • the production model (the primary goal),
    • and, optionally, the teacher model itself (closing a secondary flywheel on labelling quality).
  5. Reuse the same pipeline to clean historical ground truth. Capsight explicitly notes the VLM-assisted flow "allows us to efficiently clean errors from our historical ground truth data" — the same infrastructure that labels new data finds mistakes in old data.

Compared to adjacent patterns

  • patterns/human-calibrated-llm-labeling — same structural idea for general LLM labelling; VLM-assisted pre-labeling is the computer-vision + multimodal specialisation, with the additional teacher model detail.
  • patterns/low-confidence-to-human-review routes only low-confidence outputs to humans. VLM-assisted pre- labeling routes all outputs through humans for correction, but reduces per-image time to seconds. They can compose: pre-label everything, auto-accept above a confidence threshold, send only low-confidence pre-labels to humans.
  • patterns/data-driven-annotation-curation decides which images get labelled at all; VLM-assisted pre-labeling decides how the labelling happens once an image is picked. They compose in the obvious way.
  • concepts/vlm-as-image-judge / [[patterns/vlm-evaluator- quality-gate]] use a VLM to score a model's output — this pattern uses a VLM to seed training-label creation. Same underlying capability, different pipeline position.

Why it works

Vision-language models are now good enough at general object recognition, bounding-box hints, and short-string transcription (like barcodes) that their raw outputs exceed the accuracy floor humans need to start from. The inference economics are also favourable: a single VLM call on one image is orders of magnitude cheaper than a human annotator's time for the same image. The value-add of the human shrinks from "produce the whole label" to "catch residual errors" — a qualitatively easier, faster task.

Caveats

  • Teacher-model quality drives ceiling quality. If the VLM confidently generates wrong labels and humans rubber- stamp them, the flywheel amplifies errors. Mitigate with audit sampling (random draw of pre-label + correction pairs reviewed by senior annotators) — patterns/human-in-the-loop-quality-sampling.
  • Task fit matters. Pre-labelling works for structured, recognisable outputs. For fuzzy / subjective labels ("is this product image on-brand?"), pre-labels help less.
  • Tooling matters. If the annotation UI doesn't make accepting a pre-label a single click, the claimed speedup doesn't materialise.

Seen in

Last updated · 319 distilled / 1,201 read