Skip to content

SYSTEM Cited by 1 source

Capsight

Capsight is Instacart's end-to-end ML data platform for the Caper smart-cart fleet. It converts the deployed fleet into a distributed data-collection system that improves its own computer-vision and multi-sensor models over time — an edge→cloud data flywheel Collect → Manage → Label → Train → Deploy that closes on itself.

Why it exists

Caper is a stability-critical hardware product (a crash / misread leads to cart abandonment — the same engineering stance documented in the Caper Android migration on companies/instacart). Its accuracy depends on models trained to handle real-world in-store conditions: variable lighting, partial occlusion, damaged packaging, motion blur, unusual camera angles, store-specific SKUs. Pre-Capsight, models were trained primarily on manually-collected data that did not reflect the production distribution ([[concepts/ production-data-diversity]]); the team had little observability into what carts actually experienced; and the full collect → label → train → release loop took a month, making iteration cost scale linearly with deployment size — explicitly called out as unacceptable.

Three components

Capsight Collector (edge)

An on-device agent that runs on every cart. Captures a synchronised multi-modal view (camera + weight scale + location) of every interaction. Engineered for zero performance regression on the cart's primary AI tasks and no impact on retailer store networks.

Key design choices:

  • Trigger-based capture. Does not stream all data to the cloud. Captures only when a composite signal fires — currently activity signal (e.g. hand motion) + recognised barcode. More triggers in development. Canonical instance of patterns/trigger-based-edge-capture.
  • Dedicated hardware video encoder. The cart's AI inference and the collector share the device; video encoding is offloaded to dedicated hardware so capture contributes zero cycles to the AI inference path. concepts/hardware-offload.
  • Dedicated protocol for weight + location. Designed to avoid performance degradation in the non-video sensor streams.
  • Local-disk buffer + resilient uploader. Captured data is stored locally first; the uploader "carefully manages upload timing and bandwidth to avoid any impact on retailer operations or network performance"; includes a storage-threshold check that pauses collection if disk fills, and an auto-cleanup that drops oldest files if upload fails. Canonical instance of patterns/resilient-edge-uploader.
  • Multi-modal platform by design. Phase-1 prioritises high-value CV data; the Collector is already integrated with scale and location streams and will expand.

Capsight Depot (cloud)

A centralised cloud data platform that ingests raw sensor packages from every Collector and produces curated, labelled training datasets.

Stages:

  1. Ingestion + processing. Distributed pipeline ingests raw files, extracts metadata, runs quality checks.
  2. Indexing + search + visualisation. All data + metadata securely stored, indexed, semantically enriched; a web UI lets engineers filter by metadata and jump directly to the matching video clips and logs. This is the observability substrate for the fleet — engineers can reproduce anything a cart experienced.
  3. AI-accelerated annotation. The load-bearing innovation:
    • Filter out empty-background images automatically.
    • Run a Vision-Language Model plus Instacart's internal teacher models to generate high-quality pre-labels for items + barcodes.
    • Send pre-labelled images to human annotators for correction, not from-scratch creation.
    • Projected >70% annotation cost reduction; multi-day tasks → hours.
    • Same pipeline used to clean errors from historical ground-truth data. Canonical instance of patterns/vlm-assisted-pre-labeling. Uses the teacher-model idiom from [[concepts/knowledge- distillation]] applied to labelling rather than online inference.

Capsight Learner (cloud)

A distributed, Ray-based training platform that consumes curated Depot datasets to train new model versions. Includes an automated evaluation pipeline that benchmarks each candidate against standardised test sets — only validated improvements ship to production carts.

Measured impact of Learner specifically: model training stage dropped from one week to two days, with the rest of the speedup coming from Depot's faster labelling.

Operational outcomes

Against the pre-Capsight baseline:

Metric Before After
Annotation cost human from-scratch, $$$ >70% lower (VLM pre-label + human correct)
Labelling task time multi-day hours
Model training stage 1 week 2 days
End-to-end iteration cycle ~1 month 1 week
Model accuracy improvement >5% within weeks of Capsight deployment

Design ideas distilled

Future work (stated in post)

  • Full multi-modal sensor fusion — camera + weight + motion + location combined into a foundation model of the in-store environment.
  • Detecting complex multi-item interactions and intent.
  • Location-based experience improvements.
  • Automatic surfacing of "the most valuable data for model improvements" — i.e. a trained selector on top of the Collector's trigger.
  • Cost optimisation: multi-attribute extraction in a single pass (same idiom as [[patterns/multi-attribute-multi-product- prompt-batching]] on Instacart's PARSE platform) and efficient VLM inference.

Seen in

Last updated · 319 distilled / 1,201 read