PATTERN Cited by 1 source
Teacher-Student Model Compression¶
Teacher-student model compression is the engineering pattern of wrapping knowledge distillation into a production deployment shape: pick a model class that solves the task with the quality you want, accept that it cannot run on the serving substrate, run it offline as a teacher, train a student with a different architecture to imitate the teacher on the serving input distribution, deploy only the student. Teacher never runs in production.
The pattern is named separately from distillation because the decision structure is architectural, not algorithmic — it dictates what you ship, how the training and serving pipelines relate, and what the product team can vary independently.
Shape¶
[teacher corpus] ──► [TEACHER] (offline, expensive, high-quality)
│
│ produces outputs for
▼
[representative input sample]
│
│ (input, teacher-output) pairs
▼
[STUDENT training]
│
▼
[STUDENT] ─► serving substrate (phone / browser / edge)
Key invariants:
- Teacher stays offline. Teacher compute is paid in training-infra hours, not per-serving-request. Upgrading the teacher is an offline retraining job.
- Student stays on-distribution. Student is trained on the input distribution the serving target will actually see. If that distribution drifts, the student retrains; teacher architecture doesn't need to change.
- Teacher / student architectures decouple. They often don't even share a model class — a diffusion teacher with a convolutional student, or a transformer teacher with an MLP student, are common shapes.
Canonical wiki instance — YouTube real-time generative AI effects (2025-08-21)¶
Google Research describes the pattern deployed end-to-end on YouTube's real-time on-device generative AI effects (Source: sources/2025-08-21-google-from-massive-models-to-mobile-magic-tech-behind-youtube-real-time-generative-ai):
- Teacher (initial): custom-trained StyleGAN2 on a curated facial-effects dataset, optionally paired with StyleCLIP for text-driven facial-feature manipulation.
- Teacher (upgraded): Google DeepMind's Imagen diffusion model. Upgrade expanded effect fidelity and style diversity.
- Student: UNet-based image-to-image architecture with a MobileNet encoder and a MobileNet-block decoder. Runs at camera-frame rate on user phones.
- What the pattern buys YouTube: a product-scale effect library driven by powerful generative teachers, delivered through a mobile-substrate-compatible student. Teacher upgrades (StyleGAN2 → Imagen) ship new effects without disturbing the serving stack — the student retrains; the app binary ships a new weights bundle.
When the pattern fits¶
- Serving substrate structurally rejects the ideal model class. The teacher can't be quantised / pruned down to fit; it needs a different architecture entirely. YouTube's StyleGAN2 / Imagen → MobileNet-UNet is this case.
- Input distribution is narrow relative to the teacher's generality. Student doesn't need to match the teacher on all possible inputs — only the inputs the serving target will see. Narrower distributions are easier to distil.
- Product cadence needs teacher-side iteration without disturbing the serving stack. Separating teacher and student lets product evolution (new effects, new styles) happen on the teacher side while the serving stack stays stable.
- Runtime fallback to the teacher is impossible or undesirable. When the serving substrate can't reach the teacher at request time (mobile privacy / network / latency), distillation is the only path. This distinguishes the pattern from patterns/cheap-approximator-with-expensive-fallback, where a runtime fallback is part of the design.
When it doesn't¶
- Serving substrate can host the ideal model. If the teacher fits (with quantisation), no distillation step needed.
- Serving distribution is the full distribution. Covering the whole teacher generality with a small student is hard; approximator-with-fallback shapes are preferable.
- Teacher is cheap at serving time. Then distillation buys nothing; serve the teacher directly.
Failure modes¶
- Out-of-distribution inputs. Student has no principled behaviour on inputs the teacher output distribution didn't cover at training time. Critical for camera / microphone / user-content inputs where distribution shift is real.
- Teacher-upgrade drift. New teacher changes output distribution; student retraining is an explicit release dependency. Product regression risks if the pipeline doesn't manage it.
- Quality ceiling. Student quality is bounded by teacher quality on the serving distribution plus whatever the student architecture's capacity can absorb. No free lunch.
- Debugging asymmetry. Production issues manifest in the student; root cause often sits in the teacher or the teacher-generated training set. Teacher traces aren't in the serving stack.
Relation to patterns/cheap-approximator-with-expensive-fallback¶
Both patterns address the same economic force — an expensive reference computation, a cheap runtime budget — but differ on one load-bearing axis: is there a runtime path to the expensive computation.
| Axis | Teacher-student compression | Cheap-approximator-with-fallback |
|---|---|---|
| Expensive computation reachable at serving time | No | Yes |
| Student / approximator has to cover full distribution | Yes | No — fallback covers OOD |
| Calibrated uncertainty load-bearing | Usually not required | Required (gates fallback) |
| Canonical wiki instance | YouTube real-time gen-AI effects | Google RLM / Borg bin-packer |
The choice is substrate-driven, not taste-driven: if the phone can't reach the teacher in real time, it's distillation-only.
Seen in¶
- sources/2025-08-21-google-from-massive-models-to-mobile-magic-tech-behind-youtube-real-time-generative-ai — canonical wiki instance at image-to-image level on mobile. Teacher: systems/stylegan2 → systems/imagen-google-deepmind. Student: UNet + MobileNet.
Related¶
- concepts/knowledge-distillation — the underlying technique.
- concepts/on-device-ml-inference — the typical deployment target.
- concepts/training-serving-boundary — the architectural split the pattern formalises.
- patterns/cheap-approximator-with-expensive-fallback — sibling pattern; different answer to "is there a runtime fallback".
- patterns/draft-verify-inference — a third two-model deployment shape. Teacher-student compression has the teacher offline; cheap-approximator-with-fallback has the expensive computation reachable at request time on a fraction of requests; draft-verify-inference has both models online per request with the expensive one acting as a per-token verifier rather than a per-query fallback. All three share the drafter-expert-shaped economic substrate.
- systems/mobilenet — canonical student-backbone architecture.
- systems/unet-architecture — canonical image-to-image student topology.
- systems/youtube-real-time-generative-ai-effects — canonical wiki production instance.