Skip to content

PATTERN Cited by 1 source

Distributed fleet as data pipeline

Intent

Treat a deployed fleet of edge devices — smart carts, cameras, vehicles, robots, phones, sensors — as a distributed data-collection pipeline feeding ML model training. The fleet's primary job is still the customer-facing workload (checkout, navigation, capture), but the fleet is also the upstream of the ML data pipeline, and the platform around it has to be designed for that dual role.

When to use

  • A large-enough deployed fleet exists that it materially covers the input distribution your models care about.
  • Production data is the only viable source of real-world diversity (concepts/production-data-diversity) — manually collected / purchased data underfits.
  • The org can commit to the infrastructure cost of the cloud side of the pipeline (ingest, index, annotate, train) — the pattern is useless if the edge uploads data but nothing consumes it.

Shape

The pipeline has four named stages that correspond 1:1 with the dataflow in concepts/edge-cloud-data-flywheel:

  1. On-device capture with triggers. Devices capture only meaningful events. See patterns/trigger-based-edge-capture.
  2. Edge-to-cloud transport. Resilient uploader with local buffer, bandwidth shaping, storage self-protection. See patterns/resilient-edge-uploader.
  3. Cloud data management. Ingestion, indexing, search, visualisation, annotation. Becomes the team's observability-of-the-fleet substrate as well as the training-data source. Automated curation and labelling (patterns/vlm-assisted-pre-labeling, patterns/data-driven-annotation-curation) are the load-bearing cost reducers at scale.
  4. Training + deployment. Distributed training platform (Ray, SageMaker, Kubeflow) consumes curated datasets; automated evaluation gates release; new weights flow back to the fleet.

Why it's a pattern, not an obvious default

Making the fleet work as a data pipeline requires deliberate engineering commitments on the edge side most teams skip:

  • Performance isolation. The primary workload must not regress because of the collector. Capsight uses dedicated hardware video encoding and a separate communication protocol for weight + location to guarantee this.
  • Network friendliness. If the fleet lives on foreign networks (retailer stores, cellular, customer homes), upload bursts create real customer-facing problems.
  • Storage self-protection. Devices must degrade gracefully, not brick themselves, under adverse upload conditions.
  • Observability on the fleet. Fleet-wide health of the data-pipeline role has to be independently monitored (pause rates, cleanup events, trigger fire rates, per-store upload throughput).

Design goal: decouple iteration cost from fleet size

Capsight's authors make this explicit:

We want a rapid, automated way to learn from the vast diversity of real-world data so the full end-to-end model iteration cycle can improve on a weekly cadence instead of every month, and so the cost of iteration would not grow linearly with deployment size.

Naively, more deployed devices means more raw data, means more labelling, means more annotation staff, means slower iteration. The pattern's design target is to make annotation + training costs sublinear or constant in fleet size via:

  • Trigger-based capture (data volume is a function of events, not device-hours).
  • AI-assisted annotation (per-label cost drops with scale rather than growing).
  • Automated curation (humans only see the data worth looking at).

Sibling data-pipeline idioms

  • Connected-vehicle fleets (Tesla, Waymo, Cruise) have documented versions of this pattern with richer triggers (disengagement, driver intervention, rare-object detection).
  • Industrial IoT / predictive maintenance — sensors on factory equipment feed anomaly-detection model retraining.
  • Mobile ML — on-device inference + federated learning is the privacy-preserving variant where labels stay on-device rather than uploading raw data.

Seen in

Last updated · 319 distilled / 1,201 read