SYSTEM Cited by 1 source

Capsight¶

Capsight is Instacart's end-to-end ML data platform for the Caper smart-cart fleet. It converts the deployed fleet into a distributed data-collection system that improves its own computer-vision and multi-sensor models over time — an edge→cloud data flywheel Collect → Manage → Label → Train → Deploy that closes on itself.

Why it exists¶

Caper is a stability-critical hardware product (a crash / misread leads to cart abandonment — the same engineering stance documented in the Caper Android migration on companies/instacart). Its accuracy depends on models trained to handle real-world in-store conditions: variable lighting, partial occlusion, damaged packaging, motion blur, unusual camera angles, store-specific SKUs. Pre-Capsight, models were trained primarily on manually-collected data that did not reflect the production distribution ([[concepts/ production-data-diversity]]); the team had little observability into what carts actually experienced; and the full collect → label → train → release loop took a month, making iteration cost scale linearly with deployment size — explicitly called out as unacceptable.

Three components¶

Capsight Collector (edge)¶

An on-device agent that runs on every cart. Captures a synchronised multi-modal view (camera + weight scale + location) of every interaction. Engineered for zero performance regression on the cart's primary AI tasks and no impact on retailer store networks.

Key design choices:

Trigger-based capture. Does not stream all data to the cloud. Captures only when a composite signal fires — currently activity signal (e.g. hand motion) + recognised barcode. More triggers in development. Canonical instance of patterns/trigger-based-edge-capture.
Dedicated hardware video encoder. The cart's AI inference and the collector share the device; video encoding is offloaded to dedicated hardware so capture contributes zero cycles to the AI inference path. concepts/hardware-offload.
Dedicated protocol for weight + location. Designed to avoid performance degradation in the non-video sensor streams.
Local-disk buffer + resilient uploader. Captured data is stored locally first; the uploader "carefully manages upload timing and bandwidth to avoid any impact on retailer operations or network performance"; includes a storage-threshold check that pauses collection if disk fills, and an auto-cleanup that drops oldest files if upload fails. Canonical instance of patterns/resilient-edge-uploader.
Multi-modal platform by design. Phase-1 prioritises high-value CV data; the Collector is already integrated with scale and location streams and will expand.

Capsight Depot (cloud)¶

A centralised cloud data platform that ingests raw sensor packages from every Collector and produces curated, labelled training datasets.

Stages:

Ingestion + processing. Distributed pipeline ingests raw files, extracts metadata, runs quality checks.
Indexing + search + visualisation. All data + metadata securely stored, indexed, semantically enriched; a web UI lets engineers filter by metadata and jump directly to the matching video clips and logs. This is the observability substrate for the fleet — engineers can reproduce anything a cart experienced.
AI-accelerated annotation. The load-bearing innovation:
- Filter out empty-background images automatically.
- Run a Vision-Language Model plus Instacart's internal teacher models to generate high-quality pre-labels for items + barcodes.
- Send pre-labelled images to human annotators for correction, not from-scratch creation.
- Projected >70% annotation cost reduction; multi-day tasks → hours.
- Same pipeline used to clean errors from historical ground-truth data. Canonical instance of patterns/vlm-assisted-pre-labeling. Uses the teacher-model idiom from [[concepts/knowledge- distillation]] applied to labelling rather than online inference.

Capsight Learner (cloud)¶

A distributed, Ray-based training platform that consumes curated Depot datasets to train new model versions. Includes an automated evaluation pipeline that benchmarks each candidate against standardised test sets — only validated improvements ship to production carts.

Measured impact of Learner specifically: model training stage dropped from one week to two days, with the rest of the speedup coming from Depot's faster labelling.

Operational outcomes¶

Against the pre-Capsight baseline:

Metric	Before	After
Annotation cost	human from-scratch, $$$	>70% lower (VLM pre-label + human correct)
Labelling task time	multi-day	hours
Model training stage	1 week	2 days
End-to-end iteration cycle	~1 month	1 week
Model accuracy improvement	—	>5% within weeks of Capsight deployment

Design ideas distilled¶

concepts/edge-cloud-data-flywheel — the fleet improves its own models; iteration cost should not grow linearly with fleet size.
concepts/production-data-diversity — only production data captures the real input distribution.
patterns/distributed-fleet-as-data-pipeline — every deployed cart is also a data-collection node.
patterns/trigger-based-edge-capture — only capture when a meaningful event is detected.
patterns/resilient-edge-uploader — local buffer + bandwidth shaping + storage self-protection + upload-failure cleanup.
patterns/vlm-assisted-pre-labeling — VLM + teacher models generate pre-labels; humans correct.
concepts/hardware-offload — dedicated video encoder + dedicated sensor protocol ⇒ zero AI-task regression.
concepts/knowledge-distillation — teacher models in the pre-labelling pipeline (not in online inference, as in patterns/offline-teacher-online-student-distillation, but as a labelling-time substitute for scarce human oracles).
concepts/multi-modal-attribute-extraction — future-work foundation model over camera + weight + motion + location.

Future work (stated in post)¶

Full multi-modal sensor fusion — camera + weight + motion + location combined into a foundation model of the in-store environment.
Detecting complex multi-item interactions and intent.
Location-based experience improvements.
Automatic surfacing of "the most valuable data for model improvements" — i.e. a trained selector on top of the Collector's trigger.
Cost optimisation: multi-attribute extraction in a single pass (same idiom as [[patterns/multi-attribute-multi-product- prompt-batching]] on Instacart's PARSE platform) and efficient VLM inference.

Seen in¶

Instacart — Turning Data into Velocity (2026-02-17) — architecture introduction post.

systems/instacart-caper — the edge device.
systems/ray — Learner's training substrate.
Sibling Instacart ML platforms with the same model-agnostic platform posture: PIXEL (image generation), PARSE (structured attribute extraction), Maple (batch LLM), Intent Engine (query understanding).
companies/instacart