PATTERN Cited by 1 source
Offline teacher → online student distillation¶
Intent¶
Run an expensive, high-quality offline inference pipeline (typically a frontier LLM + RAG + domain-specific context) not only to serve production head traffic via a cache, but also to curate the training dataset for a smaller real-time student model that serves the tail.
The two uses of the offline pipeline's output are dual-purposed: the same LLM calls produce both the cache entries that serve 98% of live traffic and the supervised labels for the student that serves the remaining 2%. No separate dataset-labeling pipeline is needed.
Why this specific shape matters¶
The naive teacher-student shape ships a student, retires the teacher. In that shape, the teacher's output is training data only. If you want the teacher's quality on any live traffic, you have to run it live — expensive.
This pattern inverts the framing: the teacher is structurally kept in the serving path (via the cache) for the head, while being repurposed to produce training data for the student that covers the tail. Three consequences:
- Economy of investment. Every LLM token spent in the offline pipeline contributes to both serving quality (cache) and student quality (training data). No duplicate spend.
- Student training data is fresh. As the offline pipeline processes new head queries (for cache population), it also emits new training examples. The student's training set grows automatically.
- Quality ceiling is the teacher, not the student. On the head, serving quality is directly the teacher's output. On the tail, serving quality is approximately the teacher's output (modulo student capacity). The teacher is the quality anchor.
Shape¶
[live head traffic] ────────────────┐
▼
┌─────────────────────────────┐
│ Head-query cache │
└─────────────┬───────────────┘
▲ populate
│
┌──────────────────────────────────────┐
│ Offline teacher pipeline │
│ (RAG + domain context + frontier LLM │
│ + post-processing guardrail) │
└──────────────────┬────────────────────┘
│ training set
▼
┌─────────────────────────────┐
│ Supervised training │
│ (LoRA fine-tune on student)│
└─────────────┬───────────────┘
▼
[live tail traffic] ───────▶ [real-time student]
Canonical wiki instance: Instacart SRL¶
Instacart's Intent Engine SRL system (Source: sources/2025-11-13-instacart-building-the-intent-engine):
- Offline teacher: RAG pipeline with conversion history + catalog + brand-embedding similarity + frontier LLM + post-processing guardrail. Produces high-quality (query, tag-set) pairs.
- Dual output (post's words): "A low-latency cache containing the validated, high quality tags for our most common queries" + "A high-quality training dataset, which is used to teach a light weight real-time model."
- Student training: Llama-3-8B + LoRA on the teacher-generated curriculum dataset.
- Student serving: adapter-merged, on H100, ~300 ms per query, serves only cache-miss ~2% of traffic.
- Outcome: student F1 within ~0.1% of frontier teacher; 50% reduction in tail-query user complaints.
The load-bearing insight per the Instacart post: "To manage costs and prove value, we began with an offline LLM pipeline on high-frequency 'head' queries. This cost-effective approach handled the bulk of traffic and generated the data needed to later train a 'student' model for the long tail." — explicit confirmation that the dual-use was the design, not a later bolt-on.
Comparison¶
- patterns/teacher-student-model-compression — parent pattern. The present pattern is a specialisation where the teacher is retained in production via a cache instead of being retired after training.
- patterns/head-cache-plus-tail-finetuned-model — the serving-architecture counterpart. Where this pattern is about the training pipeline shape, head-cache-plus-tail is about the traffic routing shape. They are two halves of the same system.
- patterns/cheap-approximator-with-expensive-fallback — inverted polarity. Cheap first, expensive on uncertainty. This pattern: offline teacher + cache first, cheap student on miss.
- Classical distillation — the academic framing (Hinton/Vinyals/Dean 2015) trains the student on the teacher's soft labels / logits. Offline-teacher-online-student as deployed at Instacart is response distillation — supervised fine-tuning on the teacher's final output. Colloquial industry usage of "distillation" and "teacher/student" fits either.
Trade-offs¶
- Teacher-pipeline cost is front-loaded. You're paying to run the frontier LLM on the head — which is expensive per call, but cheap per query if the cache amortises.
- Student has an inherent quality cap. Whatever the teacher gets wrong, the student learns as correct. If the teacher systematically mis-tags a class of queries, the student inherits the bias. Mitigation: post-processing guardrails on the teacher's output before it becomes training data.
- Training-set distribution bias. The offline pipeline runs on head queries, but the student serves tail queries. If head and tail are distributionally different (different vocab, different structure), the student may underperform on the tail. Mitigation: deliberately include some tail queries in the offline pipeline's curation pass.
- Cache + student + teacher compose into a three-source-of-truth problem. When the teacher's prompt changes, the old cache + the old student both become stale in different ways. Cache can be regenerated; the student has to be retrained.
Caveats¶
- The pattern assumes teacher output is cacheable AND label-able. If the teacher emits streaming or highly context-dependent output, packaging it as both cache-serving and supervised-label is more work.
- Offline pipeline cost scales with head size, not tail size. This is the economic advantage — the frontier LLM only runs on the head — but it also means head-size growth can blow through the offline budget even while tail-size growth is free.
- Multiple tasks per base requires careful isolation: if the student is fine-tuned on multiple teacher outputs (SRL + classification + rewrites), the training regime has to balance them deliberately.
Seen in¶
- sources/2025-11-13-instacart-building-the-intent-engine — canonical reference; explicit description of the dual-use offline RAG pipeline producing both the head cache and the student's training dataset.
Related¶
- patterns/teacher-student-model-compression — parent pattern
- patterns/head-cache-plus-tail-finetuned-model — serving-architecture counterpart
- patterns/cheap-approximator-with-expensive-fallback — inverted-polarity sibling
- concepts/knowledge-distillation — academic framing
- concepts/training-serving-boundary — cross-boundary concern: teacher is offline-training + online-cache-serving simultaneously
- concepts/lora-low-rank-adaptation — typical student-training mechanism
- concepts/context-engineering — the teacher's prompt-construction technique
- systems/instacart-intent-engine
- companies/instacart