PATTERN Cited by 1 source
Head cache plus tail fine-tuned model¶
Intent¶
Serve an ML inference workload whose traffic is power-law-distributed over inputs (search queries, URLs, product SKUs) by splitting the traffic into a head and a tail:
- Head (the common inputs) — precompute the inference result with an expensive, high-quality offline pipeline + cache. Live traffic hits the cache at near-zero latency + zero per-request inference cost.
- Tail (the rare, long-tail inputs) — serve with a fast real-time model, typically a smaller fine-tuned or distilled student trained on the offline pipeline's output.
The pattern captures the best of both worlds: the quality of the expensive offline model (via the cache) for most traffic; the coverage of a real-time model for the infinite-cardinality tail that cannot be pre-computed.
When to apply¶
Four conditions together:
- Traffic is power-law over inputs — ~80-98% of requests cluster on a small head set. Search, catalog extraction, URL rendering all fit.
- High-quality offline inference is meaningfully slower or more expensive than a production-viable student — otherwise you'd just serve the head model everywhere.
- Tail cardinality is unbounded enough that pre-computation doesn't cover it — "the 'long-tail' of new and unique searches is effectively infinite" (Instacart).
- The student model can be trained on the offline pipeline's output — the dual-purpose offline-teacher-online-student pipeline is the enabler.
Shape¶
┌──────────────────────┐
│ Expensive offline │
│ teacher pipeline │
│ (RAG + frontier LLM)│
└──────────┬───────────┘
│
┌─────────────────────┴─────────────────────┐
▼ ▼
┌─────────────────┐ ┌──────────────────┐
│ Cache (head) │ │ Training dataset│
│ ~98% of traffic│ │ for student │
└────────┬────────┘ └──────────┬───────┘
│ │
│ ▼
│ ┌──────────────────┐
│ │ Real-time │
│ │ student model │
│ │ (fine-tuned 8B) │
│ │ ~2% of traffic │
│ └──────────┬───────┘
│ │
└────────────────────┬──────────────────────┘
▼
┌───────────────┐
│ Live traffic │
└───────────────┘
Traffic routes on cache hit/miss. Miss → real-time student.
Canonical wiki instance: Instacart SRL¶
Instacart's Intent Engine SRL system is the canonical wiki reference (Source: sources/2025-11-13-instacart-building-the-intent-engine):
- Teacher pipeline: offline RAG (conversion data + catalog + brand-similarity embeddings) + frontier LLM + post-processing guardrail → tagged head queries.
- Cache: head-query tag lookup; serves ~98% of queries.
- Student: LoRA-fine-tuned Llama-3-8B trained on the teacher pipeline's output; served at ~300 ms on H100 via adapter merge.
- Cache-miss share: ~2% of queries.
- Production impact: 6% scroll-depth reduction, 50% reduction in user complaints, millions of cold-start queries served weekly.
The economic crux is the 2% number: "smart caching meant only 2% of queries needed real-time inference." The fine-tuned 8B only pays its serving cost on that 2% — a ~50× cost reduction vs. routing all queries through it.
Comparison to adjacent patterns¶
- patterns/traffic-aware-prerendering — the web/CDN analogue. Pre-render the head of URLs with an expensive build; serve the long tail dynamically. Same power-law-response shape, different artefact.
- patterns/cheap-approximator-with-expensive-fallback — inverted polarity. Cheap approximator runs first, expensive model on high uncertainty. Head-cache-plus-tail uses the cache first, cheap student on miss — a different trigger (exact-match lookup) and a different cost shape (the expensive one ran offline, not in the request path).
- patterns/teacher-student-model-compression — the training half of this pattern. Head-cache-plus-tail includes teacher-student compression but adds the dual-purpose cache + the routing split + the head/tail traffic economics.
- patterns/llm-extraction-cache-by-similarity — adjacent: cache keyed by approximate similarity instead of exact match. Useful when head cardinality is too high for exact caching; orthogonal to the head/tail split discussed here.
Trade-offs¶
- Cache freshness is a product decision. Head queries' tags may drift when the catalog changes or seasonality shifts. TTL, refresh cadence, and invalidation on catalog events are all tuning knobs.
- Head/tail split is a business-facing assumption. The 98/2 split is Instacart's; a different workload can have 80/20 or 70/30 or even 50/50. Measure before designing.
- Tail quality is bounded by the student. Investing in a better offline teacher improves head cache and student training data — but the tail only sees the student's approximation. If the teacher-student quality gap is too large, tail quality lags unacceptably.
- Offline pipeline → cache invalidation is a data-eng problem. The pipeline needs a complete-and-deploy-atomically discipline to avoid serving half-updated caches.
- Monitoring discipline: head and tail have different SLOs. Head is cache-hit latency (should be ~0 ms); tail is student-inference latency (measured, e.g., p50 / p95). Alerting on "overall latency" blurs the two.
Caveats¶
- Dual-use teacher output is not free. Designing the offline pipeline so its output is both production-cacheable AND training-data-grade requires discipline — the artefacts need schema validity, versioning, and dedup guarantees appropriate to both uses.
- Precision-over-recall posture on the student. Instacart explicitly ships higher precision / lower recall for SRL because a confident-wrong tag is worse than a missing tag in retrieval context. Per-workload polarity needs reconfirming.
Seen in¶
- sources/2025-11-13-instacart-building-the-intent-engine — canonical reference; hybrid cache + Llama-3-8B LoRA student at Instacart for production SRL with 98/2 head/tail split and explicit economic numbers.
Related¶
- patterns/offline-teacher-online-student-distillation — training-half companion pattern
- patterns/teacher-student-model-compression — more general model-compression shape
- patterns/cheap-approximator-with-expensive-fallback — inverted polarity sibling
- patterns/traffic-aware-prerendering — web-rendering analogue
- patterns/llm-extraction-cache-by-similarity — approximate-match cache variant
- concepts/query-understanding / concepts/semantic-role-labeling / concepts/long-tail-query
- concepts/power-law-url-traffic — web-side analogue
- concepts/lora-low-rank-adaptation / concepts/adapter-merging — student-serving levers
- systems/instacart-intent-engine
- companies/instacart