CONCEPT Cited by 3 sources
On-Device ML Inference¶
On-device ML inference is the running of ML inference on end-user hardware — smartphones, laptops, browsers, embedded devices, NPUs — rather than on cloud servers. The serving-infra framing: the inference compute is paid in watts of the user's battery and milliseconds of the user's frame budget instead of in rack-rented GPU-hours. Model design, compilation, quantisation, and deployment all have to respect a different cost function from server-side inference.
Constraints on-device inference runs into¶
On-device is budget-constrained along multiple axes simultaneously:
- Compute. Mobile CPUs / GPUs / NPUs are orders of magnitude slower than datacentre accelerators. Models that run comfortably at server p99 are structurally impossible at camera-frame rate on a phone.
- Memory. Phone RAM is typically 2–12 GB total across the entire OS. Model weights, activations, and scratch space all compete with the rest of the app and the OS.
- Model size on disk. Users download apps. An extra 500 MB of model weights can halve install conversion. Pressure to keep the on-device model small comes from distribution, not just runtime.
- Battery / thermal. Sustained inference drains battery and trips thermal throttling. A model that passes functional benchmarks may be unshippable if it heats the device in minutes.
- Heterogeneity. The device fleet spans flagships, mid-range, and entry-level phones across Android + iOS across the last 3–5 years. A single model / precision / op-set may not run on all of them; per-device-class specialisation is the norm.
Why teams choose on-device over server-side¶
- Latency. Real-time use cases (camera filters, live transcription, AR overlays, AV perception) can't round-trip to a server at every frame. On-device is often the only latency point that works.
- Privacy. User data (camera frames, microphone audio, contacts) never leaves the device. Avoids regulatory exposure and user trust erosion.
- Offline / flaky-network behaviour. Serving works whether or not the user is connected. Important for consumer apps that ship globally.
- Cost at scale. Consumer apps at the billion-user scale (YouTube, Maps, photo apps) can't economically serve per-user ML features from cloud GPU inference; the server-side unit economics don't close. On-device shifts the compute cost to the user's device.
The engineering stack¶
On-device ML serving is a compilation pipeline, not a model pipeline:
- Trained model (PyTorch, JAX, TF).
- Compression — distillation to a smaller architecture (concepts/knowledge-distillation), quantisation (concepts/quantization), pruning.
- Runtime compile — TensorFlow Lite / LiteRT, Core ML, ONNX Runtime Mobile, custom inference engines.
- Hardware dispatch — CPU, mobile GPU, NPU / Neural Engine, DSP. Op-set support varies; fallback paths required.
- Packaging — weights shipped inside the app bundle or downloaded on first launch; often multiple variants keyed by device class.
- Observability — on-device metrics (latency histograms, thermal events, crashes) reported back to the server for model-quality monitoring.
Canonical wiki instance¶
YouTube real-time generative AI effects — YouTube ships stylised camera effects generated by a small on-device student model (UNet + MobileNet backbone + MobileNet-block decoder) that was distilled from a large offline teacher (StyleGAN2 then Imagen). The teacher is "far too slow for real-time use"; the student runs at camera frame rate on the phone (Source: sources/2025-08-21-google-from-massive-models-to-mobile-magic-tech-behind-youtube-real-time-generative-ai). Canonical because the post names the tensions explicitly: a powerful generative model class that the serving target (a phone) cannot run, bridged by distillation to a mobile-efficient student architecture (MobileNet).
Architectures selected for on-device¶
A running list of architectures whose existence is justified by on-device constraints:
- MobileNet — depthwise-separable convolutions; Google's mobile-first CNN family (v1 2017, v2 2018, v3 2019).
- EfficientNet-Lite — EfficientNet variants tuned for mobile ops / quantisation.
- MobileViT — vision-transformer / MobileNet hybrid for mobile.
- DistilBERT / TinyBERT / MobileBERT — distilled BERT variants for on-device NLP.
- SqueezeNet — early (2016) on-device CNN compression shape.
The "how do we know it works?" problem and federated analytics¶
On-device inference solves the per-frame-latency / per-user-privacy / per-device-cost problem at deployment time, but creates a measurement gap: the team running the on-device model can't observe its real-world performance the way they would for a server-side model (direct logging of inputs, outputs, error rates).
The 2026-05-27 Google zero-trust-aggregation post articulates this gap explicitly:
"When models are deployed locally on-device, simply knowing that a model is 'running' isn't enough to understand its behavior, effectiveness, or failure modes. This limits the ability to answer critical questions like: * Is the model drifting? (e.g., Does a translation model struggle with new slang emerging in a specific region?) * Are there hidden biases? (e.g., Is an image classifier less accurate under specific lighting conditions common in certain geographic areas?) * What is the real-world error rate?"
The wiki-canonical answer is federated analytics — aggregate effectiveness metadata across the device fleet without ever shipping per-device data off the device. The 2026-05-27 post canonicalises the third-generation architecture for this: cryptography-plus-TEE defense-in-depth composing lattice-based secure aggregation + TEE attestation + DP noise. First named production target is Android System SafetyCore (Android 9+), which hosts on-device safety classifiers whose effectiveness is measured by the federated-analytics composition.
This pairing is structural: on-device ML inference and federated analytics are dual disciplines — one delivers the model to where the data is; the other observes the model's behaviour without the data leaving.
Seen in¶
- sources/2025-08-21-google-from-massive-models-to-mobile-magic-tech-behind-youtube-real-time-generative-ai — YouTube's real-time generative AI effects pipeline, teacher-student distillation deployed to camera-frame-rate on-device inference via UNet-based MobileNet student.
- sources/2025-10-15-google-coral-npu-a-full-stack-platform-for-edge-ai — Coral NPU as a reference hardware substrate for the tightest end of the on-device envelope: always-on ambient sensing (~512 GOPS at a few milliwatts for hearables, AR glasses, smartwatches). Names the fragmented edge-ML ecosystem as the software dual of the on-device problem: each proprietary accelerator forces per-vendor compiler and command-buffer integration.
- sources/2026-05-27-google-private-analytics-via-zero-trust-aggregation — On-device ML inference's structural measurement gap (no central logging of per-user inputs / outputs / error rates) as the load-bearing motivator for federated analytics. Canonical wiki articulation that on-device inference and federated-analytics measurement are dual disciplines. First named production deployment of the third-generation private-analytics architecture is Android System SafetyCore's on-device safety classifiers.
Related¶
- concepts/knowledge-distillation — the standard path to making a large-model capability fit the on-device substrate.
- concepts/quantization — composable compression axis.
- concepts/training-serving-boundary — on-device serving collapses the boundary to a single shipped binary.
- concepts/always-on-ambient-sensing — the tightest-power subset of the on-device envelope.
- concepts/ml-first-architecture — the chip-design stance that emerges when the on-device workload is ML-dominated.
- concepts/fragmented-hardware-software-ecosystem — the software-side consequence of heterogeneous edge accelerators.
- concepts/federated-analytics — dual discipline; how on-device models are observed without data leaving the device.
- concepts/secure-aggregation — cryptographic primitive for privacy-preserving fleet measurement.
- patterns/teacher-student-model-compression — engineering pattern that delivers an on-device student.
- patterns/cryptography-plus-tee-defense-in-depth — the measurement-side composition that complements on-device deployment.
- systems/mobilenet — canonical on-device backbone.
- systems/youtube-real-time-generative-ai-effects — canonical wiki production instance.
- systems/coral-npu — canonical wiki reference-silicon instance for the ambient-sensing end of the envelope.
- systems/android-safetycore — canonical wiki on-device-classifier consumer of the federated-analytics measurement infrastructure.