Skip to content

PATTERN Cited by 1 source

Resilient edge uploader

Intent

Move captured data from an edge device to the cloud without impacting the device's primary workload or the network it lives on, and without losing data when either the device's local storage fills or the upload path fails. Sits between on-device capture and the cloud data platform.

When to use

  • Edge device has bounded local storage (cart computer, embedded device, phone).
  • Device lives on a foreign network whose bandwidth / capacity you don't own — retailer stores, customer homes, cellular links. Upload bursts will be noticed and resented.
  • Upload path is unreliable — store Wi-Fi drops, cellular variability, long intervals between backhaul opportunities.
  • Data is valuable (cost to re-collect is high) so dropping captures is a last resort, not a default.

Mechanism

Four cooperating policies on the device:

  1. Write-first, upload-later. Every captured artefact is persisted to local storage first (atomic write, fsynced). Upload is a separate stage that reads from local storage. If the device reboots mid-upload, no data is lost.
  2. Bandwidth-aware upload scheduling. The uploader "carefully manages upload timing and bandwidth to avoid any impact on retailer operations or network performance" — concrete mechanisms typically include: rate-limiting the upload socket, uploading during off-peak windows, pausing during store business hours, only uploading when a known- idle / known-dedicated link is available.
  3. Storage-threshold check that pauses collection. If local disk usage exceeds a configurable threshold, the uploader signals the collector to stop capturing new data. This is explicitly a backpressure signal (concepts/backpressure) from the upload path back to the capture path — the device does not overflow its own disk silently.
  4. Auto-cleanup of oldest files on upload failure. If uploads continue to fail and storage keeps growing despite collection being paused, the uploader drops the oldest files first. The assumption: newer data is more valuable (reflects current conditions) than stale un-uploaded data. The failure mode the device won't enter is "filled disk, unable to boot, unable to capture anything".

Discipline

  • The threshold is configurable, not hard-coded. Different devices have different disk sizes and capture rates; the threshold has to be tuned per hardware SKU.
  • Monitor the pause + cleanup signals centrally. If a subset of devices persistently pauses or cleans up, that's a fleet-health signal — probably a network issue at specific stores, or a bug in the upload credentials, or an upstream cloud-ingest slowdown. Treat these events as alerting metrics, not silent degradation.
  • Pause-collection is not silent data loss. Decide explicitly whether the fleet-level SLA is "never pause collection" (in which case the threshold becomes an escalation) or "pause is acceptable during outages" (in which case you just log it).

Trade-offs

  • Buffer depth vs. freshness of insight. Deep on-device buffers mean captures survive longer outages but cloud analysis lags the actual event. Shallower buffers mean the flywheel spins faster but is more sensitive to upload reliability.
  • Upload timing strategy vs. telemetry urgency. If the device emits safety-critical events you need to see in real-time, the "off-peak only" strategy conflicts with that. Usually solved by a two-tier telemetry split — small, critical events stream immediately; bulk captures (video) ship opportunistically.
  • Retention of dropped-by-cleanup files. None, by construction — but if you expect selective retention ("keep one clip per day even under pressure"), encode that as a cleanup-priority rule rather than strict FIFO.

Seen in

Last updated · 319 distilled / 1,201 read