Skip to content

AWS Architecture Blog — Automate safety monitoring with computer vision and generative AI

Summary

AWS Architecture Blog retrospective on a serverless, event-driven computer-vision + generative-AI safety-monitoring solution that continuously analyses fixed-camera feeds across distribution-center floors to detect PPE violations and housekeeping hazards. Scale target: 10,000+ cameras (validated via simultaneous processing of 10,000 images), end-to-end image-capture → notification in up to 37 s. Core architectural content: a serverless driver-worker pattern with one-worker-per-use-case for independent scaling + fault isolation; a real SageMaker Serverless → Serverful inference pivot at production scale (GPU unsupported + 6 GB memory ceiling → out-of-memory errors → migration to ml.g6 ml.g6 endpoints with auto scaling); multi-account AWS isolation separating training / image-collection / web-app / analytics environments; a four-stage intelligent alarm detection pipeline (object detection → zone-based analysis against "digital tape" → loiter-time persistence → multilayered validation with confidence thresholds + RLE mask comparison); per-camera-per-use-case risk aggregation to avoid alert fatigue; data-driven ground-truth curation at scale via Athena aggregating false-positive rates + Claude multimodal on Bedrock analysing misclassified samples; and a GLIGEN-based synthetic-data generation pipeline on SageMaker Batch Transform producing 75,000-image PPE + 75,000-image Housekeeping datasets with auto-embedded ground-truth annotations, reaching 99.5% mAP@50 for PPE + 94.3% mAP@50 for Housekeeping without a single manually-annotated real image.

Key takeaways

  1. Serverless driver-worker pattern is the scaling substrate. The system implements a driver-worker decomposition where the driver orchestrates work distribution while workers process images concurrently, "enabling independent scaling of different components while providing fault isolation. If one worker fails, it doesn't impact the entire pipeline." Each use case (PPE detection, Housekeeping) runs its own worker pipeline through its own SNS topic + SQS queue + SageMaker endpoint — allowing per-use- case capacity + release cadence. Canonical instance of patterns/serverless-driver-worker. The ML inference layer acts as an intelligent gatekeeper: "Rather than flooding downstream components with every captured image, the inference layer only surfaces images where safety issues have been detected. This filtering is essential because it prevents components interacting with Amazon Aurora PostgreSQL from being overwhelmed by the raw volume of image data from hundreds of sites."

  2. SageMaker Serverless → Serverful pivot under production load. "Initially, we deployed SageMaker Serverless inference endpoints with approximately 50 cameras. However, as we scaled to processing images from hundreds of sites, we encountered critical limitations: SageMaker Serverless inference lacked GPU support and imposed a 6 GB maximum memory configuration, leading to out-of-memory errors." Fix: pivot to SageMaker Serverful inference endpoints configured with ml.g6 family instances + auto-scaling policies + AWS service- team collaboration to raise limits for thousands of concurrent Lambda executions + memory-allocation + multithreading + tuned SQS batch sizes.

  3. Multi-account AWS isolation as the security / blast-radius primitive. Training pipeline, image-collection infrastructure, end-user web application, and BI analytics account each run in distinct AWS accounts with appropriate access controls + data isolation. Raw images land in an access-restricted account and are purged within days after Rekognition face detection + custom Python overlay blur anonymises them; anonymised copies are replicated across accounts for their specific downstream purpose (patterns/multi-account-isolation).

  4. Four-stage intelligent alarm detection pipeline. Stage 1: object detection (YOLO-based for PPE, custom CV model for equipment / materials; the model distinguishes visible outline from floor footprint). Stage 2: zone-based analysis over "digital tape" — predefined zones marked by floor tapes; the system calculates percentage overlap between each detected object's footprint and these restricted zones against a configurable threshold (typically 50%) to filter edge cases. Stage 3: loiter-time algorithm — tracks same object across consecutive minute-by-minute intervals via mask similarity algorithms, building a "replication count" showing how many consecutive minutes an object has persisted in violation; different object types + risk zones have distinct acceptable loiter times (high-risk areas enforce shorter thresholds). Stage 4: multilayered validation — confidence thresholds filter low- certainty detections based on object-type complexity + Run-Length Encoding (RLE) mask comparison verifies tracked-object consistency across intervals rather than different objects appearing in similar positions. Composed into patterns/multilayered-alarm-validation as a reusable shape.

  5. Per-camera-per-use-case risk aggregation. "This function intelligently aggregates risks per camera per use case to avoid alert fatigue. Instead of bombarding safety teams with duplicate notifications, the system appends new occurrences to existing open risks." Every minute, a scheduled Lambda checks whether risks still appear in the latest camera images; resolved violations auto-close; another scheduled function checks SLA exhaustion + escalates through preferred channels (Slack, email, ticket management). Introduces concepts/alert-fatigue + patterns/alarm-aggregation-per-entity.

  6. Data-driven ground-truth curation at scale. Original approach — daily annotation jobs per site — became untenable at hundreds of geographically distributed sites. Fix: "We fundamentally reimagined our workflow by using Amazon Athena to query and analyze massive volumes of inference results combined with customer feedback data at scale. We identified underperforming segments by aggregating false positive rates across camera types and deployment conditions, prioritizing retraining on image sources with elevated error rates. We also surfaced inferences where model confidence scores fell below established thresholds, flagging these uncertain predictions for targeted annotation and review. We further augmented this analysis with Claude multi-modal LLMs on Amazon Bedrock to analyze misclassified samples and detect underrepresented object classes in our existing training distribution." Canonical patterns/data-driven-annotation-curation — "intelligent, performance-driven curation" replacing "blanket sampling."

  7. GLIGEN-based synthetic data generation via SageMaker Batch Transform. Motivations: floor-spill detection where "despite examining and annotating over half a million images, only a few hundred examples of liquid spills or debris on walkways were identified" (rare-event class imbalance) + PPE color diversity (training data dominated by single color; workers may wear acceptable PPE in different colors → detection blind spots). GLIGEN (Grounded Language-to-Image Generation) deployed as SageMaker Batch Transform jobs receiving structured bounding-box inputs + generating photorealistic 512×512 facility scenes with ground-truth annotations automatically embedded — converted to YOLO annotation format via parallel Python workers. Produced 75,000-image PPE dataset (person / hard hat / safety vest) + 75,000-image Housekeeping dataset (pallet jack / go-cart / step ladder / trash can / safety cone / tote / pallet). YOLOv8 trained on SageMaker AI with PyTorch 2.1 + cosine learning-rate scheduling + AdamW optimization ("critical for stabilizing the larger YOLOv8l model variant and preventing gradient divergence during training"). Introduces patterns/synthetic-data-generation + systems/gligen.

  8. Training-pipeline / model-promotion decoupling. A GT Job Step Functions workflow triggered by EventBridge on configurable cadence creates SageMaker Ground Truth labeling jobs; post-processing Lambda transforms output + stores job metadata in DynamoDB. Data scientists then trigger SageMaker AI Pipelines (7 steps: checkpoint loading → data prep+split → training → drift baseline → evaluation → packaging → registration). Model approval fires an EventBridge event that triggers a model-promotion Lambda which opens a code review against the application infrastructure repo to update the S3 URI of the model used for the SageMaker endpoint. Decouples science + application updates cleanly: scientists approve when metrics meet criteria, software engineers merge + manage endpoint updates via normal CI/CD. Approved checkpoints seed future retraining runs, enabling incremental improvements without frequent long-running training jobs.

  9. Tape-labeling synthetic-composite preparation. For Housekeeping the system must understand spatial relationships between detected objects + 5S floor tapes — but "floor tapes frequently become obscured by equipment, materials, and personnel movement throughout the day. This occlusion makes it difficult for human annotators to accurately identify and label the tape boundaries when onboarding new cameras." Fix: hourly Step Functions workflow analyses multiple camera frames captured at different times + their corresponding object-detection predictions; a voting mechanism identifies pixel regions with no detected objects

  10. stitches these clear portions together into a composite image where tapes are fully visible. Generated tape-overlay UI illustrations via Amazon Nova.

  11. Reported accuracy + latency + scale. PPE model 99.5% mAP@50 with 100% precision + 100% recall across all three classes, trained entirely on GLIGEN-synthetic data. Housekeeping model 94.3% mAP@50 / 91.4% precision / 86.9% recall across seven facility-object classes, again without a single manually-annotated real image. End-to-end latency up to 37 seconds from image capture to Zone Operator notification. Scale validated via simultaneous processing of 10,000 images (one frame per camera) across a 10,000+ camera fleet.

Systems named

  • systems/aws-lambda — post-processing, model promotion, risk aggregation, scheduled risk-resolution + SLA checks, tape-labeling preparation. Named failure mode: SageMaker Serverless inference's 6 GB memory + no-GPU constraint forcing a pivot to ml.g6 Serverful endpoints.
  • systems/aws-sns + systems/aws-sqs — per-use-case queue isolation. Each SageMaker endpoint is invoked by a dedicated SQS queue fed by SNS; failures go to use-case-specific Dead Letter Queues for later analysis / re-drive.
  • systems/aws-s3 — raw-image ingest (access-restricted account, auto-purged within days), anonymised-image replication across accounts, annotation storage, violation artifacts, synthetic-dataset staging, Redshift Spectrum query substrate.
  • systems/amazon-rekognition — face detection for automatic anonymisation; custom Python overlay blurs detected faces.
  • systems/aws-sagemaker-ai — training pipeline (7-step SageMaker AI Pipelines), inference endpoints (Serverful ml.g6), Batch Transform for GLIGEN synthetic-data generation, Ground Truth for labelling jobs.
  • systems/aws-sagemaker-endpoint — Serverful ml.g6 family endpoints with auto-scaling policies for per-use-case inference.
  • systems/aws-sagemaker-ground-truth — labelling job substrate for manual + expert annotation.
  • systems/aws-sagemaker-batch-transform — synthetic-data generation runtime for GLIGEN + YOLO annotation format conversion.
  • systems/aws-step-functions — GT job creation workflow + hourly tape-labeling preparation workflow.
  • systems/amazon-eventbridge — cadence triggers for GT job creation + model-approval → promotion Lambda; EventBridge schedule triggers risk-resolution + SLA checks every minute.
  • systems/dynamodb — GT job metadata, structured violation records for fast queries, inference-processing state management.
  • systems/amazon-aurora — PostgreSQL-backed application state; inference gatekeeper filters image volume to prevent Aurora overload.
  • systems/amazon-cloudfront + systems/aws-appsync + custom resolvers — React web-app distribution + API backed by Lambda resolvers embedding Amazon QuickSight analytics + CRUD operations.
  • systems/amazon-route53 + systems/aws-waf — DNS + web vulnerability protection.
  • systems/amazon-quicksight — safety-manager + operations-lead dashboards over risk data; compare performance across facilities.
  • systems/amazon-redshift-spectrum — BI team queries risk data in S3 without ETL movement.
  • systems/amazon-athenadata-driven annotation curation: aggregate false-positive rates across cameras + deployment conditions, surface below-threshold-confidence inferences for targeted annotation.
  • systems/amazon-bedrock — Claude multi-modal LLMs analyse misclassified samples + detect underrepresented object classes; Nova generates tape-labelling UI illustrations.
  • systems/gligen — diffusion-based Grounded Language-to-Image Generation model; deployed on SageMaker Batch Transform; produces photorealistic 512×512 facility scenes with ground-truth bounding-box annotations embedded in output.
  • systems/yolo — computer-vision model family (YOLOv8 / YOLOv8l) used for PPE detection + Housekeeping; trained with PyTorch 2.1 + cosine LR scheduling + AdamW.

Concepts surfaced

Patterns

  • patterns/serverless-driver-worker — canonical instance; driver orchestrates work distribution, workers process images concurrently, independent scaling + fault isolation per component, one worker failure doesn't impact the pipeline; inference layer filters downstream fan-out.
  • patterns/multilayered-alarm-validation — four-stage composition (object detection → zone overlap → loiter persistence → confidence + RLE-mask validation) that turns per-frame detections into auditable alerts.
  • patterns/alarm-aggregation-per-entity — per-camera-per-use-case roll-up; auto-close on resolution; SLA escalation; flexible channel routing (Slack / email / ticket).
  • patterns/data-driven-annotation-curation — replace blanket per-site daily annotation with Athena-driven FP-rate aggregation + below-threshold-confidence sampling + Claude multi-modal analysis of misclassified samples for class imbalance.
  • patterns/synthetic-data-generation — GLIGEN + SageMaker Batch Transform + structured bounding-box inputs + auto-embedded annotations; produces 75,000-image datasets per use case without manual labelling; critical for rare events + class diversity.
  • patterns/multi-account-isolation — training / image-collection / web-app / analytics each in separate accounts with appropriate access controls + data isolation; image auto-purge within days in the restricted-access ingest account.

Operational numbers

  • 10,000+ cameras validated via simultaneous processing of 10,000 images (one frame per camera at a time).
  • Up to 37 s end-to-end from image capture to Zone Operator notification.
  • PPE model: 99.5% mAP@50 / 100% precision / 100% recall across three classes (person / hard hat / safety vest) — trained entirely on GLIGEN synthetic data.
  • Housekeeping model: 94.3% mAP@50 / 91.4% precision / 86.9% recall across seven classes (pallet jack / go-cart / step ladder / trash can / safety cone / tote / pallet) — also no manually- annotated real images.
  • 75,000-image PPE synthetic dataset + 75,000-image Housekeeping synthetic dataset, 512×512 resolution.
  • ~500,000 real images annotated for floor-spill class → only a few hundred real spill examples → forcing function for synthetic data.
  • 50-camera PoC scale broke SageMaker Serverless; hundreds of sites exposed the 6 GB memory + no-GPU ceilings.
  • Typical zone-overlap threshold: 50% for violation flagging.
  • Loiter tracking: minute-by-minute intervals.
  • Risk-resolution check cadence: every minute (scheduled Lambda).
  • BI export cadence: every hour (flexible).
  • Image retention: raw images auto-purged within days after anonymisation.

Caveats

  • Warehouse / distribution-center deployment examples only — the authors explicitly frame the architecture as industry-agnostic (manufacturing / clean rooms / construction) but those extensions are future work, not shipped.
  • Accuracy numbers are on synthetic-only training + synthetic-image testing in the reported headline; author acknowledges accuracy "can be further improved by increasing the volume of training images used to build and train the custom model" and that manual annotation remains important for unique site conditions + new use cases.
  • No disclosure of Aurora cluster size, Lambda concurrent-execution quota ceilings actually raised to, SageMaker ml.g6 fleet count, per-endpoint QPS, SageMaker AI Pipelines per-step runtime, or GLIGEN inference cost per image on Batch Transform.
  • No incident retrospective or failure-mode taxonomy for the DLQs beyond "can later be analyzed or re-drove."
  • Agentic / near-real-time extensions beyond "up to 37 s" not explored; the architecture is near-real-time not real-time.
  • Marketing-leaning AWS Architecture Blog format — architecture is well-described but production incident baselines, cost per prediction, and comparison against alternatives (non-GLIGEN synthetic, manual-only ground truth) are not disclosed.
Last updated · 200 distilled / 1,178 read