PATTERN Cited by 1 source
Data-driven annotation curation¶
Replace blanket per-site daily annotation with intelligent, performance-driven curation that directs human labelling effort only where it would most improve the model. Named step-change when blanket sampling scales past the annotation team's capacity.
Shape¶
Three signals composed into the curation pipeline:
- False-positive-rate aggregation across conditions. Query inference results + customer feedback via Amazon Athena over S3-backed logs; bucket by camera type + deployment conditions + other dimensions; prioritise retraining on image sources with elevated error rates.
- Low-confidence sampling. Surface inferences where model confidence scores fell below established thresholds; these uncertain predictions are flagged for targeted annotation — directs human time toward cases near the decision boundary, which teach the model most per-label.
- Multi-modal LLM analysis of misclassified samples. Use Claude (or similar multi-modal LLM) on Amazon Bedrock to analyse misclassified examples + detect underrepresented object classes in the existing training distribution. Output: a class-imbalance map that directly informs the next data-collection / synthetic-data priorities.
Output feeds a SageMaker Ground Truth labelling job-generation workflow that now creates targeted jobs rather than blanket one-job-per-site-per-day.
Why it works¶
- Sustainability: the labelling team's capacity stops being the scaling constraint; per-site growth doesn't linearly grow annotation headcount.
- Training efficiency: labels at the decision boundary + on underperforming segments + on underrepresented classes carry more gradient per label than a random site sample.
- Compounding with synthetic data: the class-imbalance map from LLM analysis directly informs synthetic-data generation priorities (patterns/synthetic-data-generation); rare classes get synthetic augmentation, annotation budget goes to real-world edge cases.
When to apply¶
- Inference results + customer feedback signals exist at scale (FP marking, missed-detection reports).
- Model is already serving traffic, so confidence scores + labels are joinable.
- Annotation team has become the bottleneck on model-improvement velocity.
Seen in¶
- sources/2026-04-01-aws-automate-safety-monitoring-with-computer-vision-and-generative-ai — canonical wiki instance. Original approach (daily annotation jobs per site) became "untenable" at hundreds of geographically distributed sites. "We fundamentally reimagined our workflow by using Amazon Athena to query and analyze massive volumes of inference results combined with customer feedback data at scale. We identified underperforming segments by aggregating false positive rates across camera types and deployment conditions… We also surfaced inferences where model confidence scores fell below established thresholds… We further augmented this analysis with Claude multi-modal LLMs on Amazon Bedrock to analyze misclassified samples and detect underrepresented object classes in our existing training distribution."
Related¶
- concepts/relevance-labeling — the label-provenance framing.
- patterns/human-calibrated-llm-labeling + patterns/behavior-discrepancy-sampling — Dropbox Dash sibling patterns for retrieval-quality labelling.
- patterns/synthetic-data-generation — the downstream lever once class imbalance is identified.