CONCEPT Cited by 1 source

Production data diversity¶

Definition¶

Production data diversity is the observation that the distribution of inputs an ML model sees in production is systematically broader and messier than the distribution it sees during training on manually collected or purchased datasets — and that training-time data gaps in this dimension are the dominant source of production model regressions.

Restated as a claim: the only reliable source of "what the real input distribution looks like" is the deployed system itself; everything else is a proxy.

The gap, concretely¶

For a CV model in a grocery smart cart (Instacart Caper), training data gathered manually (staged shots, clean products, controlled lighting) misses:

Lighting variation — different store HVAC, time of day, ceiling height, camera position.
Occlusion — partial views as items are picked up, neighbouring items blocking the target, the user's hand.
Damaged packaging — crushed boxes, torn labels, faded print — real stock is not studio stock.
Motion blur — hand motion, cart motion, shake.
Unusual angles — items placed sideways, upside-down, backwards.
Store-specific SKUs — regional products, seasonal lines, store-brand items that were never in the training corpus.

Analogous gaps exist across domains:

Search — real user queries have typos, multi-intent phrasing, and long tails not captured by engineer-written test queries. See concepts/long-tail-query.
LLM chat — production users prompt in ways synthetic eval sets never cover.
Autonomous vehicles — weather, construction zones, jaywalkers, road debris; the canonical domain where production-data-diversity is named as the hard problem.

Implications for architecture¶

Production data has to enter the training pipeline. If it doesn't, models underfit the tail by construction. See concepts/edge-cloud-data-flywheel.
Random sampling of production data is not enough — the interesting examples (failures, edge cases) are rare by definition. Either capture at trigger-defined moments (patterns/trigger-based-edge-capture), or curate afterwards by signal ([[patterns/data-driven-annotation- curation]]).
Label the tail, not the mean. Blanket labelling of all production data burns annotation budget on easy cases. VLM pre-labelling + human correction (patterns/vlm-assisted-pre-labeling) or low-confidence-routed review ([[patterns/low-confidence-to- human-review]]) concentrate human time where it matters.
Update models on a cadence that matches the drift. Production distributions evolve (new products, new stores, new user behaviours). Slow retraining = stale models.

Cited evidence¶

From the Instacart Capsight post (2026-02-17):

Our models were primarily trained on manually-collected data that didn't fully reflect the complexity of real-world stores, including lighting changes, occlusions, damaged packaging, motion blur, unusual angles, and store-specific products.

After the Capsight flywheel was deployed:

Within just weeks of deployment, we collected enough diverse, real-world data to train improved models that showed more than 5% improvement in accuracy, with continued gains as the deployment scales. Our dataset is now richer and systematically captures edge cases, lighting variations, and store-specific products that make our AI models more robust.

Seen in¶

systems/capsight — Caper CV model accuracy >5% improvement within weeks of Capsight data being in the loop, with further gains as fleet scales. (Source: sources/2026-02-17-instacart-turning-data-into-velocity-capers-edge-and-cloud-data-flywheel-with-capsight)