CONCEPT Cited by 3 sources

On-Device ML Inference¶

On-device ML inference is the running of ML inference on end-user hardware — smartphones, laptops, browsers, embedded devices, NPUs — rather than on cloud servers. The serving-infra framing: the inference compute is paid in watts of the user's battery and milliseconds of the user's frame budget instead of in rack-rented GPU-hours. Model design, compilation, quantisation, and deployment all have to respect a different cost function from server-side inference.

Constraints on-device inference runs into¶

On-device is budget-constrained along multiple axes simultaneously:

Compute. Mobile CPUs / GPUs / NPUs are orders of magnitude slower than datacentre accelerators. Models that run comfortably at server p99 are structurally impossible at camera-frame rate on a phone.
Memory. Phone RAM is typically 2–12 GB total across the entire OS. Model weights, activations, and scratch space all compete with the rest of the app and the OS.
Model size on disk. Users download apps. An extra 500 MB of model weights can halve install conversion. Pressure to keep the on-device model small comes from distribution, not just runtime.
Battery / thermal. Sustained inference drains battery and trips thermal throttling. A model that passes functional benchmarks may be unshippable if it heats the device in minutes.
Heterogeneity. The device fleet spans flagships, mid-range, and entry-level phones across Android + iOS across the last 3–5 years. A single model / precision / op-set may not run on all of them; per-device-class specialisation is the norm.

Why teams choose on-device over server-side¶

Latency. Real-time use cases (camera filters, live transcription, AR overlays, AV perception) can't round-trip to a server at every frame. On-device is often the only latency point that works.
Privacy. User data (camera frames, microphone audio, contacts) never leaves the device. Avoids regulatory exposure and user trust erosion.
Offline / flaky-network behaviour. Serving works whether or not the user is connected. Important for consumer apps that ship globally.
Cost at scale. Consumer apps at the billion-user scale (YouTube, Maps, photo apps) can't economically serve per-user ML features from cloud GPU inference; the server-side unit economics don't close. On-device shifts the compute cost to the user's device.

The engineering stack¶

On-device ML serving is a compilation pipeline, not a model pipeline:

Trained model (PyTorch, JAX, TF).
Compression — distillation to a smaller architecture (concepts/knowledge-distillation), quantisation (concepts/quantization), pruning.
Runtime compile — TensorFlow Lite / LiteRT, Core ML, ONNX Runtime Mobile, custom inference engines.
Hardware dispatch — CPU, mobile GPU, NPU / Neural Engine, DSP. Op-set support varies; fallback paths required.
Packaging — weights shipped inside the app bundle or downloaded on first launch; often multiple variants keyed by device class.
Observability — on-device metrics (latency histograms, thermal events, crashes) reported back to the server for model-quality monitoring.

Canonical wiki instance¶

YouTube real-time generative AI effects — YouTube ships stylised camera effects generated by a small on-device student model (UNet + MobileNet backbone + MobileNet-block decoder) that was distilled from a large offline teacher (StyleGAN2 then Imagen). The teacher is "far too slow for real-time use"; the student runs at camera frame rate on the phone (Source: sources/2025-08-21-google-from-massive-models-to-mobile-magic-tech-behind-youtube-real-time-generative-ai). Canonical because the post names the tensions explicitly: a powerful generative model class that the serving target (a phone) cannot run, bridged by distillation to a mobile-efficient student architecture (MobileNet).

Architectures selected for on-device¶

A running list of architectures whose existence is justified by on-device constraints:

MobileNet — depthwise-separable convolutions; Google's mobile-first CNN family (v1 2017, v2 2018, v3 2019).
EfficientNet-Lite — EfficientNet variants tuned for mobile ops / quantisation.
MobileViT — vision-transformer / MobileNet hybrid for mobile.
DistilBERT / TinyBERT / MobileBERT — distilled BERT variants for on-device NLP.
SqueezeNet — early (2016) on-device CNN compression shape.

The "how do we know it works?" problem and federated analytics¶

On-device inference solves the per-frame-latency / per-user-privacy / per-device-cost problem at deployment time, but creates a measurement gap: the team running the on-device model can't observe its real-world performance the way they would for a server-side model (direct logging of inputs, outputs, error rates).

The 2026-05-27 Google zero-trust-aggregation post articulates this gap explicitly:

"When models are deployed locally on-device, simply knowing that a model is 'running' isn't enough to understand its behavior, effectiveness, or failure modes. This limits the ability to answer critical questions like: * Is the model drifting? (e.g., Does a translation model struggle with new slang emerging in a specific region?) * Are there hidden biases? (e.g., Is an image classifier less accurate under specific lighting conditions common in certain geographic areas?) * What is the real-world error rate?"

The wiki-canonical answer is federated analytics — aggregate effectiveness metadata across the device fleet without ever shipping per-device data off the device. The 2026-05-27 post canonicalises the third-generation architecture for this: cryptography-plus-TEE defense-in-depth composing lattice-based secure aggregation + TEE attestation + DP noise. First named production target is Android System SafetyCore (Android 9+), which hosts on-device safety classifiers whose effectiveness is measured by the federated-analytics composition.

This pairing is structural: on-device ML inference and federated analytics are dual disciplines — one delivers the model to where the data is; the other observes the model's behaviour without the data leaving.

Seen in¶

sources/2025-08-21-google-from-massive-models-to-mobile-magic-tech-behind-youtube-real-time-generative-ai — YouTube's real-time generative AI effects pipeline, teacher-student distillation deployed to camera-frame-rate on-device inference via UNet-based MobileNet student.
sources/2025-10-15-google-coral-npu-a-full-stack-platform-for-edge-ai — Coral NPU as a reference hardware substrate for the tightest end of the on-device envelope: always-on ambient sensing (~512 GOPS at a few milliwatts for hearables, AR glasses, smartwatches). Names the fragmented edge-ML ecosystem as the software dual of the on-device problem: each proprietary accelerator forces per-vendor compiler and command-buffer integration.
sources/2026-05-27-google-private-analytics-via-zero-trust-aggregation — On-device ML inference's structural measurement gap (no central logging of per-user inputs / outputs / error rates) as the load-bearing motivator for federated analytics. Canonical wiki articulation that on-device inference and federated-analytics measurement are dual disciplines. First named production deployment of the third-generation private-analytics architecture is Android System SafetyCore's on-device safety classifiers.

concepts/knowledge-distillation — the standard path to making a large-model capability fit the on-device substrate.
concepts/quantization — composable compression axis.
concepts/training-serving-boundary — on-device serving collapses the boundary to a single shipped binary.
concepts/always-on-ambient-sensing — the tightest-power subset of the on-device envelope.
concepts/ml-first-architecture — the chip-design stance that emerges when the on-device workload is ML-dominated.
concepts/fragmented-hardware-software-ecosystem — the software-side consequence of heterogeneous edge accelerators.
concepts/federated-analytics — dual discipline; how on-device models are observed without data leaving the device.
concepts/secure-aggregation — cryptographic primitive for privacy-preserving fleet measurement.
patterns/teacher-student-model-compression — engineering pattern that delivers an on-device student.
patterns/cryptography-plus-tee-defense-in-depth — the measurement-side composition that complements on-device deployment.
systems/mobilenet — canonical on-device backbone.
systems/youtube-real-time-generative-ai-effects — canonical wiki production instance.
systems/coral-npu — canonical wiki reference-silicon instance for the ambient-sensing end of the envelope.
systems/android-safetycore — canonical wiki on-device-classifier consumer of the federated-analytics measurement infrastructure.