CONCEPT Cited by 2 sources
On-Device ML Inference¶
On-device ML inference is the running of ML inference on end-user hardware — smartphones, laptops, browsers, embedded devices, NPUs — rather than on cloud servers. The serving-infra framing: the inference compute is paid in watts of the user's battery and milliseconds of the user's frame budget instead of in rack-rented GPU-hours. Model design, compilation, quantisation, and deployment all have to respect a different cost function from server-side inference.
Constraints on-device inference runs into¶
On-device is budget-constrained along multiple axes simultaneously:
- Compute. Mobile CPUs / GPUs / NPUs are orders of magnitude slower than datacentre accelerators. Models that run comfortably at server p99 are structurally impossible at camera-frame rate on a phone.
- Memory. Phone RAM is typically 2–12 GB total across the entire OS. Model weights, activations, and scratch space all compete with the rest of the app and the OS.
- Model size on disk. Users download apps. An extra 500 MB of model weights can halve install conversion. Pressure to keep the on-device model small comes from distribution, not just runtime.
- Battery / thermal. Sustained inference drains battery and trips thermal throttling. A model that passes functional benchmarks may be unshippable if it heats the device in minutes.
- Heterogeneity. The device fleet spans flagships, mid-range, and entry-level phones across Android + iOS across the last 3–5 years. A single model / precision / op-set may not run on all of them; per-device-class specialisation is the norm.
Why teams choose on-device over server-side¶
- Latency. Real-time use cases (camera filters, live transcription, AR overlays, AV perception) can't round-trip to a server at every frame. On-device is often the only latency point that works.
- Privacy. User data (camera frames, microphone audio, contacts) never leaves the device. Avoids regulatory exposure and user trust erosion.
- Offline / flaky-network behaviour. Serving works whether or not the user is connected. Important for consumer apps that ship globally.
- Cost at scale. Consumer apps at the billion-user scale (YouTube, Maps, photo apps) can't economically serve per-user ML features from cloud GPU inference; the server-side unit economics don't close. On-device shifts the compute cost to the user's device.
The engineering stack¶
On-device ML serving is a compilation pipeline, not a model pipeline:
- Trained model (PyTorch, JAX, TF).
- Compression — distillation to a smaller architecture (concepts/knowledge-distillation), quantisation (concepts/quantization), pruning.
- Runtime compile — TensorFlow Lite / LiteRT, Core ML, ONNX Runtime Mobile, custom inference engines.
- Hardware dispatch — CPU, mobile GPU, NPU / Neural Engine, DSP. Op-set support varies; fallback paths required.
- Packaging — weights shipped inside the app bundle or downloaded on first launch; often multiple variants keyed by device class.
- Observability — on-device metrics (latency histograms, thermal events, crashes) reported back to the server for model-quality monitoring.
Canonical wiki instance¶
YouTube real-time generative AI effects — YouTube ships stylised camera effects generated by a small on-device student model (UNet + MobileNet backbone + MobileNet-block decoder) that was distilled from a large offline teacher (StyleGAN2 then Imagen). The teacher is "far too slow for real-time use"; the student runs at camera frame rate on the phone (Source: sources/2025-08-21-google-from-massive-models-to-mobile-magic-tech-behind-youtube-real-time-generative-ai). Canonical because the post names the tensions explicitly: a powerful generative model class that the serving target (a phone) cannot run, bridged by distillation to a mobile-efficient student architecture (MobileNet).
Architectures selected for on-device¶
A running list of architectures whose existence is justified by on-device constraints:
- MobileNet — depthwise-separable convolutions; Google's mobile-first CNN family (v1 2017, v2 2018, v3 2019).
- EfficientNet-Lite — EfficientNet variants tuned for mobile ops / quantisation.
- MobileViT — vision-transformer / MobileNet hybrid for mobile.
- DistilBERT / TinyBERT / MobileBERT — distilled BERT variants for on-device NLP.
- SqueezeNet — early (2016) on-device CNN compression shape.
Seen in¶
- sources/2025-08-21-google-from-massive-models-to-mobile-magic-tech-behind-youtube-real-time-generative-ai — YouTube's real-time generative AI effects pipeline, teacher-student distillation deployed to camera-frame-rate on-device inference via UNet-based MobileNet student.
- sources/2025-10-15-google-coral-npu-a-full-stack-platform-for-edge-ai — Coral NPU as a reference hardware substrate for the tightest end of the on-device envelope: always-on ambient sensing (~512 GOPS at a few milliwatts for hearables, AR glasses, smartwatches). Names the fragmented edge-ML ecosystem as the software dual of the on-device problem: each proprietary accelerator forces per-vendor compiler and command-buffer integration.
Related¶
- concepts/knowledge-distillation — the standard path to making a large-model capability fit the on-device substrate.
- concepts/quantization — composable compression axis.
- concepts/training-serving-boundary — on-device serving collapses the boundary to a single shipped binary.
- concepts/always-on-ambient-sensing — the tightest-power subset of the on-device envelope.
- concepts/ml-first-architecture — the chip-design stance that emerges when the on-device workload is ML-dominated.
- concepts/fragmented-hardware-software-ecosystem — the software-side consequence of heterogeneous edge accelerators.
- patterns/teacher-student-model-compression — engineering pattern that delivers an on-device student.
- systems/mobilenet — canonical on-device backbone.
- systems/youtube-real-time-generative-ai-effects — canonical wiki production instance.
- systems/coral-npu — canonical wiki reference-silicon instance for the ambient-sensing end of the envelope.