PATTERN Cited by 2 sources

Teacher-Student Model Compression¶

Teacher-student model compression is the engineering pattern of wrapping knowledge distillation into a production deployment shape: pick a model class that solves the task with the quality you want, accept that it cannot run on the serving substrate, run it offline as a teacher, train a student with a different architecture to imitate the teacher on the serving input distribution, deploy only the student. Teacher never runs in production.

The pattern is named separately from distillation because the decision structure is architectural, not algorithmic — it dictates what you ship, how the training and serving pipelines relate, and what the product team can vary independently.

Shape¶

[teacher corpus] ──► [TEACHER]  (offline, expensive, high-quality)
                        │
                        │ produces outputs for
                        ▼
             [representative input sample]
                        │
                        │ (input, teacher-output) pairs
                        ▼
                   [STUDENT training]
                        │
                        ▼
           [STUDENT] ─► serving substrate (phone / browser / edge)

Key invariants:

Teacher stays offline. Teacher compute is paid in training-infra hours, not per-serving-request. Upgrading the teacher is an offline retraining job.
Student stays on-distribution. Student is trained on the input distribution the serving target will actually see. If that distribution drifts, the student retrains; teacher architecture doesn't need to change.
Teacher / student architectures decouple. They often don't even share a model class — a diffusion teacher with a convolutional student, or a transformer teacher with an MLP student, are common shapes.

Canonical wiki instance — YouTube real-time generative AI effects (2025-08-21)¶

Google Research describes the pattern deployed end-to-end on YouTube's real-time on-device generative AI effects (Source: sources/2025-08-21-google-from-massive-models-to-mobile-magic-tech-behind-youtube-real-time-generative-ai):

Teacher (initial): custom-trained StyleGAN2 on a curated facial-effects dataset, optionally paired with StyleCLIP for text-driven facial-feature manipulation.
Teacher (upgraded): Google DeepMind's Imagen diffusion model. Upgrade expanded effect fidelity and style diversity.
Student: UNet-based image-to-image architecture with a MobileNet encoder and a MobileNet-block decoder. Runs at camera-frame rate on user phones.
What the pattern buys YouTube: a product-scale effect library driven by powerful generative teachers, delivered through a mobile-substrate-compatible student. Teacher upgrades (StyleGAN2 → Imagen) ship new effects without disturbing the serving stack — the student retrains; the app binary ships a new weights bundle.

When the pattern fits¶

Serving substrate structurally rejects the ideal model class. The teacher can't be quantised / pruned down to fit; it needs a different architecture entirely. YouTube's StyleGAN2 / Imagen → MobileNet-UNet is this case.
Input distribution is narrow relative to the teacher's generality. Student doesn't need to match the teacher on all possible inputs — only the inputs the serving target will see. Narrower distributions are easier to distil.
Product cadence needs teacher-side iteration without disturbing the serving stack. Separating teacher and student lets product evolution (new effects, new styles) happen on the teacher side while the serving stack stays stable.
Runtime fallback to the teacher is impossible or undesirable. When the serving substrate can't reach the teacher at request time (mobile privacy / network / latency), distillation is the only path. This distinguishes the pattern from patterns/cheap-approximator-with-expensive-fallback, where a runtime fallback is part of the design.

When it doesn't¶

Serving substrate can host the ideal model. If the teacher fits (with quantisation), no distillation step needed.
Serving distribution is the full distribution. Covering the whole teacher generality with a small student is hard; approximator-with-fallback shapes are preferable.
Teacher is cheap at serving time. Then distillation buys nothing; serve the teacher directly.

Failure modes¶

Out-of-distribution inputs. Student has no principled behaviour on inputs the teacher output distribution didn't cover at training time. Critical for camera / microphone / user-content inputs where distribution shift is real.
Teacher-upgrade drift. New teacher changes output distribution; student retraining is an explicit release dependency. Product regression risks if the pipeline doesn't manage it.
Quality ceiling. Student quality is bounded by teacher quality on the serving distribution plus whatever the student architecture's capacity can absorb. No free lunch.
Debugging asymmetry. Production issues manifest in the student; root cause often sits in the teacher or the teacher-generated training set. Teacher traces aren't in the serving stack.

Relation to patterns/cheap-approximator-with-expensive-fallback ¶

Both patterns address the same economic force — an expensive reference computation, a cheap runtime budget — but differ on one load-bearing axis: is there a runtime path to the expensive computation.

Axis	Teacher-student compression	Cheap-approximator-with-fallback
Expensive computation reachable at serving time	No	Yes
Student / approximator has to cover full distribution	Yes	No — fallback covers OOD
Calibrated uncertainty load-bearing	Usually not required	Required (gates fallback)
Canonical wiki instance	YouTube real-time gen-AI effects	Google RLM / Borg bin-packer

The choice is substrate-driven, not taste-driven: if the phone can't reach the teacher in real time, it's distillation-only.

Seen in¶

sources/2025-08-21-google-from-massive-models-to-mobile-magic-tech-behind-youtube-real-time-generative-ai — canonical wiki instance at image-to-image level on mobile. Teacher: systems/stylegan2 → systems/imagen-google-deepmind. Student: UNet + MobileNet.
sources/2025-11-13-instacart-building-the-intent-engine — LLM-serving instance at the e-commerce query-understanding layer. Teacher is an offline RAG pipeline with a frontier LLM; student is a LoRA-fine-tuned Llama-3-8B, adapter-merged and served on H100 at ~300 ms. Distinctive shape: the teacher is retained in production via a head-query cache rather than retired post-training — see patterns/offline-teacher-online-student-distillation and patterns/head-cache-plus-tail-finetuned-model for the specialised routing + dual-purpose-training shape. Production result: student F1 within ~0.1 of the frontier teacher with ~2% of the serving footprint; 50% reduction in tail-query user complaints.

concepts/knowledge-distillation — the underlying technique.
concepts/on-device-ml-inference — the typical deployment target.
concepts/training-serving-boundary — the architectural split the pattern formalises.
patterns/cheap-approximator-with-expensive-fallback — sibling pattern; different answer to "is there a runtime fallback".
patterns/draft-verify-inference — a third two-model deployment shape. Teacher-student compression has the teacher offline; cheap-approximator-with-fallback has the expensive computation reachable at request time on a fraction of requests; draft-verify-inference has both models online per request with the expensive one acting as a per-token verifier rather than a per-query fallback. All three share the drafter-expert-shaped economic substrate.
patterns/offline-teacher-online-student-distillation — a specialisation of teacher-student compression where the teacher is kept in production via a head cache + dual-purposed to also generate the student's training data. Instacart's Intent Engine SRL is the canonical instance.
patterns/head-cache-plus-tail-finetuned-model — the serving-architecture counterpart to the previous pattern; traffic is split head/tail and only the tail hits the student.
concepts/lora-low-rank-adaptation / concepts/adapter-merging — common student-training + student-deployment mechanisms at LLM scale.
systems/mobilenet — canonical student-backbone architecture.
systems/unet-architecture — canonical image-to-image student topology.
systems/youtube-real-time-generative-ai-effects — canonical wiki production instance.
systems/instacart-intent-engine — canonical LLM-serving instance.