SYSTEM Cited by 1 source
Capsight¶
Capsight is Instacart's end-to-end ML data platform for the Caper smart-cart fleet. It converts the deployed fleet into a distributed data-collection system that improves its own computer-vision and multi-sensor models over time — an edge→cloud data flywheel Collect → Manage → Label → Train → Deploy that closes on itself.
Why it exists¶
Caper is a stability-critical hardware product (a crash / misread leads to cart abandonment — the same engineering stance documented in the Caper Android migration on companies/instacart). Its accuracy depends on models trained to handle real-world in-store conditions: variable lighting, partial occlusion, damaged packaging, motion blur, unusual camera angles, store-specific SKUs. Pre-Capsight, models were trained primarily on manually-collected data that did not reflect the production distribution ([[concepts/ production-data-diversity]]); the team had little observability into what carts actually experienced; and the full collect → label → train → release loop took a month, making iteration cost scale linearly with deployment size — explicitly called out as unacceptable.
Three components¶
Capsight Collector (edge)¶
An on-device agent that runs on every cart. Captures a synchronised multi-modal view (camera + weight scale + location) of every interaction. Engineered for zero performance regression on the cart's primary AI tasks and no impact on retailer store networks.
Key design choices:
- Trigger-based capture. Does not stream all data to the cloud. Captures only when a composite signal fires — currently activity signal (e.g. hand motion) + recognised barcode. More triggers in development. Canonical instance of patterns/trigger-based-edge-capture.
- Dedicated hardware video encoder. The cart's AI inference and the collector share the device; video encoding is offloaded to dedicated hardware so capture contributes zero cycles to the AI inference path. concepts/hardware-offload.
- Dedicated protocol for weight + location. Designed to avoid performance degradation in the non-video sensor streams.
- Local-disk buffer + resilient uploader. Captured data is stored locally first; the uploader "carefully manages upload timing and bandwidth to avoid any impact on retailer operations or network performance"; includes a storage-threshold check that pauses collection if disk fills, and an auto-cleanup that drops oldest files if upload fails. Canonical instance of patterns/resilient-edge-uploader.
- Multi-modal platform by design. Phase-1 prioritises high-value CV data; the Collector is already integrated with scale and location streams and will expand.
Capsight Depot (cloud)¶
A centralised cloud data platform that ingests raw sensor packages from every Collector and produces curated, labelled training datasets.
Stages:
- Ingestion + processing. Distributed pipeline ingests raw files, extracts metadata, runs quality checks.
- Indexing + search + visualisation. All data + metadata securely stored, indexed, semantically enriched; a web UI lets engineers filter by metadata and jump directly to the matching video clips and logs. This is the observability substrate for the fleet — engineers can reproduce anything a cart experienced.
- AI-accelerated annotation. The load-bearing innovation:
- Filter out empty-background images automatically.
- Run a Vision-Language Model plus Instacart's internal teacher models to generate high-quality pre-labels for items + barcodes.
- Send pre-labelled images to human annotators for correction, not from-scratch creation.
- Projected >70% annotation cost reduction; multi-day tasks → hours.
- Same pipeline used to clean errors from historical ground-truth data. Canonical instance of patterns/vlm-assisted-pre-labeling. Uses the teacher-model idiom from [[concepts/knowledge- distillation]] applied to labelling rather than online inference.
Capsight Learner (cloud)¶
A distributed, Ray-based training platform that consumes curated Depot datasets to train new model versions. Includes an automated evaluation pipeline that benchmarks each candidate against standardised test sets — only validated improvements ship to production carts.
Measured impact of Learner specifically: model training stage dropped from one week to two days, with the rest of the speedup coming from Depot's faster labelling.
Operational outcomes¶
Against the pre-Capsight baseline:
| Metric | Before | After |
|---|---|---|
| Annotation cost | human from-scratch, $$$ | >70% lower (VLM pre-label + human correct) |
| Labelling task time | multi-day | hours |
| Model training stage | 1 week | 2 days |
| End-to-end iteration cycle | ~1 month | 1 week |
| Model accuracy improvement | — | >5% within weeks of Capsight deployment |
Design ideas distilled¶
- concepts/edge-cloud-data-flywheel — the fleet improves its own models; iteration cost should not grow linearly with fleet size.
- concepts/production-data-diversity — only production data captures the real input distribution.
- patterns/distributed-fleet-as-data-pipeline — every deployed cart is also a data-collection node.
- patterns/trigger-based-edge-capture — only capture when a meaningful event is detected.
- patterns/resilient-edge-uploader — local buffer + bandwidth shaping + storage self-protection + upload-failure cleanup.
- patterns/vlm-assisted-pre-labeling — VLM + teacher models generate pre-labels; humans correct.
- concepts/hardware-offload — dedicated video encoder + dedicated sensor protocol ⇒ zero AI-task regression.
- concepts/knowledge-distillation — teacher models in the pre-labelling pipeline (not in online inference, as in patterns/offline-teacher-online-student-distillation, but as a labelling-time substitute for scarce human oracles).
- concepts/multi-modal-attribute-extraction — future-work foundation model over camera + weight + motion + location.
Future work (stated in post)¶
- Full multi-modal sensor fusion — camera + weight + motion + location combined into a foundation model of the in-store environment.
- Detecting complex multi-item interactions and intent.
- Location-based experience improvements.
- Automatic surfacing of "the most valuable data for model improvements" — i.e. a trained selector on top of the Collector's trigger.
- Cost optimisation: multi-attribute extraction in a single pass (same idiom as [[patterns/multi-attribute-multi-product- prompt-batching]] on Instacart's PARSE platform) and efficient VLM inference.
Seen in¶
- Instacart — Turning Data into Velocity (2026-02-17) — architecture introduction post.
Related¶
- systems/instacart-caper — the edge device.
- systems/ray — Learner's training substrate.
- Sibling Instacart ML platforms with the same model-agnostic platform posture: PIXEL (image generation), PARSE (structured attribute extraction), Maple (batch LLM), Intent Engine (query understanding).
- companies/instacart