SYSTEM Cited by 1 source
GLIGEN (Grounded Language-to-Image Generation)¶
GLIGEN (Grounded Language-to-Image Generation) is a diffusion-based generative model that extends conventional text-to-image generation with structured spatial grounding: alongside a natural-language prompt, callers provide bounding-box coordinates + per-box class labels, and GLIGEN produces a photorealistic image where the specified objects appear at the specified positions.
This enables auto-annotated training-data generation: because the caller specified where each object goes, the ground-truth bounding boxes are known by construction — no human labelling step is needed to produce a training-ready dataset.
Stub page — expand as more GLIGEN-internals sources land on the wiki.
Seen in¶
- sources/2026-04-01-aws-automate-safety-monitoring-with-computer-vision-and-generative-ai — deployed as SageMaker Batch Transform jobs producing 75,000-image PPE dataset (3 classes: person / hard hat / safety vest) + 75,000-image Housekeeping dataset (7 classes: pallet jack / go-cart / step ladder / trash can / safety cone / tote / pallet) at 512×512 resolution. Paired with parallel Python workers that convert GLIGEN output + embedded bounding-box ground truth into YOLO annotation format, then upload to S3 as training-ready datasets. Resulting YOLOv8 models hit 99.5% mAP@50 for PPE + 94.3% mAP@50 for Housekeeping without a single manually-annotated real image. Canonical realisation of patterns/synthetic-data-generation.
Related¶
- patterns/synthetic-data-generation — the broader pattern GLIGEN instantiates.
- systems/aws-sagemaker-batch-transform — the batch runtime AWS customers use to produce large-scale GLIGEN-generated datasets.
- systems/yolo — the canonical downstream detector trained on GLIGEN-generated bounding-box datasets.