Skip to content

CONCEPT Cited by 2 sources

On-Device ML Inference

On-device ML inference is the running of ML inference on end-user hardware — smartphones, laptops, browsers, embedded devices, NPUs — rather than on cloud servers. The serving-infra framing: the inference compute is paid in watts of the user's battery and milliseconds of the user's frame budget instead of in rack-rented GPU-hours. Model design, compilation, quantisation, and deployment all have to respect a different cost function from server-side inference.

Constraints on-device inference runs into

On-device is budget-constrained along multiple axes simultaneously:

  • Compute. Mobile CPUs / GPUs / NPUs are orders of magnitude slower than datacentre accelerators. Models that run comfortably at server p99 are structurally impossible at camera-frame rate on a phone.
  • Memory. Phone RAM is typically 2–12 GB total across the entire OS. Model weights, activations, and scratch space all compete with the rest of the app and the OS.
  • Model size on disk. Users download apps. An extra 500 MB of model weights can halve install conversion. Pressure to keep the on-device model small comes from distribution, not just runtime.
  • Battery / thermal. Sustained inference drains battery and trips thermal throttling. A model that passes functional benchmarks may be unshippable if it heats the device in minutes.
  • Heterogeneity. The device fleet spans flagships, mid-range, and entry-level phones across Android + iOS across the last 3–5 years. A single model / precision / op-set may not run on all of them; per-device-class specialisation is the norm.

Why teams choose on-device over server-side

  • Latency. Real-time use cases (camera filters, live transcription, AR overlays, AV perception) can't round-trip to a server at every frame. On-device is often the only latency point that works.
  • Privacy. User data (camera frames, microphone audio, contacts) never leaves the device. Avoids regulatory exposure and user trust erosion.
  • Offline / flaky-network behaviour. Serving works whether or not the user is connected. Important for consumer apps that ship globally.
  • Cost at scale. Consumer apps at the billion-user scale (YouTube, Maps, photo apps) can't economically serve per-user ML features from cloud GPU inference; the server-side unit economics don't close. On-device shifts the compute cost to the user's device.

The engineering stack

On-device ML serving is a compilation pipeline, not a model pipeline:

  • Trained model (PyTorch, JAX, TF).
  • Compression — distillation to a smaller architecture (concepts/knowledge-distillation), quantisation (concepts/quantization), pruning.
  • Runtime compile — TensorFlow Lite / LiteRT, Core ML, ONNX Runtime Mobile, custom inference engines.
  • Hardware dispatch — CPU, mobile GPU, NPU / Neural Engine, DSP. Op-set support varies; fallback paths required.
  • Packaging — weights shipped inside the app bundle or downloaded on first launch; often multiple variants keyed by device class.
  • Observability — on-device metrics (latency histograms, thermal events, crashes) reported back to the server for model-quality monitoring.

Canonical wiki instance

YouTube real-time generative AI effects — YouTube ships stylised camera effects generated by a small on-device student model (UNet + MobileNet backbone + MobileNet-block decoder) that was distilled from a large offline teacher (StyleGAN2 then Imagen). The teacher is "far too slow for real-time use"; the student runs at camera frame rate on the phone (Source: sources/2025-08-21-google-from-massive-models-to-mobile-magic-tech-behind-youtube-real-time-generative-ai). Canonical because the post names the tensions explicitly: a powerful generative model class that the serving target (a phone) cannot run, bridged by distillation to a mobile-efficient student architecture (MobileNet).

Architectures selected for on-device

A running list of architectures whose existence is justified by on-device constraints:

  • MobileNet — depthwise-separable convolutions; Google's mobile-first CNN family (v1 2017, v2 2018, v3 2019).
  • EfficientNet-Lite — EfficientNet variants tuned for mobile ops / quantisation.
  • MobileViT — vision-transformer / MobileNet hybrid for mobile.
  • DistilBERT / TinyBERT / MobileBERT — distilled BERT variants for on-device NLP.
  • SqueezeNet — early (2016) on-device CNN compression shape.

Seen in

Last updated · 200 distilled / 1,178 read