PATTERN Cited by 2 sources
VLM evaluator as quality gate¶
Intent¶
Interpose a vision-language-model-based evaluator between an image-generation model and ship-to-user, so that only images passing a project-specific rubric reach the downstream consumer. Failed images feed back into the prompt-generator LLM for refinement rather than being discarded + re-sampled.
The pattern raises effective output quality without changing the underlying generator, by replacing the generator's-stochastic-distribution-is-the-quality-bar regime with a rubric-driven-refinement regime.
Mechanism¶
Four-step loop (the Instacart PIXEL reference implementation — Source: sources/2025-07-17-instacart-introducing-pixel-instacarts-unified-image-generation-platform):
- Prompt-generator LLM produces a first-pass prompt from the application context + user input.
- Evaluation-question-generator LLM produces a project-specific rubric of yes/no evaluation questions ("does the image contain X?", "is the background warm neutral?", "is there non-food content?", etc).
- VLM judge scores the generated image against each question. A pass-threshold ("N of M questions passed") gates the decision.
- On fail: failed-question text feeds back into the prompt-generator LLM to produce a revised prompt. Loop to step 1 until pass or round-budget exhausted.
user input ──► prompt-LLM ──► generator ──► VLM judge
▲ │
│ ▼
│ ┌─────────────────────────────┐
└────────│ passed? yes → ship │
failed│ no → feed fails back │
└─────────────────────────────┘
Why this beats discard-and-retry¶
Discard-and-retry samples the generator's distribution again with the same prompt — progress depends on stochastic luck. Feeding failed-question text into the prompt shifts the distribution toward outputs that address the missing dimension. The judge's failure signal becomes the optimiser's gradient; the prompt generator is the optimiser.
Why VLM-scoring beats hard-coded metrics¶
Image-quality signals like CLIP-similarity, FID, or pixel-level comparisons can't capture project-specific rubric dimensions: "does the background match the product's retailer branding?", "is the product shown in the correct quantity for the listing?", "does the lighting match the rest of the carousel?" VLMs score these in natural language, against a rubric the platform team can evolve as new failure modes surface.
See also concepts/vlm-as-image-judge for the primitive.
Reported outcome (Instacart PIXEL)¶
"Since its creation, PIXEL has utilized vision language models as a feedback loop to improve our human judges approval rate of images from 20% to 85%. […] This provided a significant improvement in image quality while decreasing manual review efforts and cost."
4.25× increase in human-judge approval rate. Note: absolute human-judge agreement with VLM-judge is not disclosed; the 85% is end-to-end after iterative refinement.
Tradeoffs / gotchas¶
- Cost compounds with iteration count. Every refinement round is generator-call + VLM-call + LLM-prompt-refinement- call. A 3-round median is ~3× single-shot cost.
- Round budget is load-bearing. Without a hard cap, the loop can consume unbounded inference on hopeless edge cases. See concepts/refinement-round-budget for the DS-STAR analog (10-round cap).
- VLM-judge alignment with humans. A VLM-judge that diverges from human preference optimises for itself. The platform must periodically measure VLM-vs-human agreement.
- Rubric evolution. New failure modes not in the rubric won't be caught. Rubric curation is long-running operational work.
- Not a safety proof. Passing the VLM rubric ≠ safe to ship — trademark / legal / brand checks still need out-of-band gates.
- Judge drift. VLM model updates invalidate historical approval-rate comparisons. Snapshot the judge version with each run.
Relationship to text-side sibling¶
concepts/llm-as-judge + concepts/iterative-plan-refinement (DS-STAR's Verifier → Router → plan-refinement loop) are the text-output siblings of this pattern. Both share the "rubric-scored-output + failed-dimension-fed-back-into-optimiser" structure; the difference is the output modality and the judge model class (VLM vs. LLM).
patterns/drafter-evaluator-refinement-loop is the direct text/structured-output sibling of this pattern at the same loop layer. Lyft's AI localization pipeline implements the identical shape for machine translation: Drafter (image generator ⇔ translation drafter) → multi-candidate → rubric judge (VLM judge ⇔ reasoning LLM judge) → critique fed back → bounded retries. The two are canonical instances at different modalities; cross-reference them when reasoning about the pattern family.
Seen in¶
-
sources/2025-07-17-instacart-introducing-pixel-instacarts-unified-image-generation-platform — canonical wiki instance. Instacart PIXEL ships the four-step loop in production; raised human-judge approval rate 20% → 85%.
-
sources/2026-02-19-lyft-scaling-localization-with-ai — text-modality sibling reference. Lyft's AI localization pipeline implements the same loop shape for machine translation — Drafter (N=3 candidates) + Evaluator (4-dim rubric) + critique-fed-refinement (3-attempt cap). Detailed wiki entry lives under the direct sibling pattern patterns/drafter-evaluator-refinement-loop; cross-listed here as the modality-sibling pattern family.
Related¶
- concepts/vlm-as-image-judge — scoring primitive
- concepts/iterative-prompt-refinement — loop primitive
- concepts/llm-as-judge — text-side sibling
- concepts/refinement-round-budget — bounding mechanism
- concepts/self-approval-bias — the generator-vs-judge separation rationale
- patterns/unified-image-generation-platform — the platform this pattern lives inside
- patterns/prompt-template-library — the pre-VLM-loop defaults layer
- patterns/planner-coder-verifier-router-loop — DS-STAR text sibling
- patterns/drafter-evaluator-refinement-loop — direct text / structured-output modality sibling (Lyft AI localization)
- patterns/multi-candidate-generation — N-candidate subroutine used in both the VLM and LLM versions
- systems/instacart-pixel — canonical production instance
- systems/lyft-ai-localization-pipeline — text-translation modality instance
- companies/instacart
- companies/lyft