PATTERN Cited by 2 sources

Head cache plus tail fine-tuned model¶

Intent¶

Serve an ML inference workload whose traffic is power-law-distributed over inputs (search queries, URLs, product SKUs) by splitting the traffic into a head and a tail:

Head (the common inputs) — precompute the inference result with an expensive, high-quality offline pipeline + cache. Live traffic hits the cache at near-zero latency + zero per-request inference cost.
Tail (the rare, long-tail inputs) — serve with a fast real-time model, typically a smaller fine-tuned or distilled student trained on the offline pipeline's output.

The pattern captures the best of both worlds: the quality of the expensive offline model (via the cache) for most traffic; the coverage of a real-time model for the infinite-cardinality tail that cannot be pre-computed.

When to apply¶

Four conditions together:

Traffic is power-law over inputs — ~80-98% of requests cluster on a small head set. Search, catalog extraction, URL rendering all fit.
High-quality offline inference is meaningfully slower or more expensive than a production-viable student — otherwise you'd just serve the head model everywhere.
Tail cardinality is unbounded enough that pre-computation doesn't cover it — "the 'long-tail' of new and unique searches is effectively infinite" (Instacart).
The student model can be trained on the offline pipeline's output — the dual-purpose offline-teacher-online-student pipeline is the enabler.

Shape¶

                              ┌──────────────────────┐
                              │   Expensive offline  │
                              │   teacher pipeline   │
                              │  (RAG + frontier LLM)│
                              └──────────┬───────────┘
                                         │
                   ┌─────────────────────┴─────────────────────┐
                   ▼                                           ▼
          ┌─────────────────┐                       ┌──────────────────┐
          │  Cache (head)   │                       │  Training dataset│
          │  ~98% of traffic│                       │  for student     │
          └────────┬────────┘                       └──────────┬───────┘
                   │                                           │
                   │                                           ▼
                   │                                ┌──────────────────┐
                   │                                │  Real-time       │
                   │                                │  student model   │
                   │                                │  (fine-tuned 8B) │
                   │                                │  ~2% of traffic  │
                   │                                └──────────┬───────┘
                   │                                           │
                   └────────────────────┬──────────────────────┘
                                        ▼
                                ┌───────────────┐
                                │ Live traffic  │
                                └───────────────┘

Traffic routes on cache hit/miss. Miss → real-time student.

Canonical wiki instance: Instacart SRL¶

Instacart's Intent Engine SRL system is the canonical wiki reference (Source: sources/2025-11-13-instacart-building-the-intent-engine):

Teacher pipeline: offline RAG (conversion data + catalog + brand-similarity embeddings) + frontier LLM + post-processing guardrail → tagged head queries.
Cache: head-query tag lookup; serves ~98% of queries.
Student: LoRA-fine-tuned Llama-3-8B trained on the teacher pipeline's output; served at ~300 ms on H100 via adapter merge.
Cache-miss share: ~2% of queries.
Production impact: 6% scroll-depth reduction, 50% reduction in user complaints, millions of cold-start queries served weekly.

The economic crux is the 2% number: "smart caching meant only 2% of queries needed real-time inference." The fine-tuned 8B only pays its serving cost on that 2% — a ~50× cost reduction vs. routing all queries through it.

Comparison to adjacent patterns¶

patterns/traffic-aware-prerendering — the web/CDN analogue. Pre-render the head of URLs with an expensive build; serve the long tail dynamically. Same power-law-response shape, different artefact.
patterns/cheap-approximator-with-expensive-fallback — inverted polarity. Cheap approximator runs first, expensive model on high uncertainty. Head-cache-plus-tail uses the cache first, cheap student on miss — a different trigger (exact-match lookup) and a different cost shape (the expensive one ran offline, not in the request path).
patterns/teacher-student-model-compression — the training half of this pattern. Head-cache-plus-tail includes teacher-student compression but adds the dual-purpose cache + the routing split + the head/tail traffic economics.
patterns/llm-extraction-cache-by-similarity — adjacent: cache keyed by approximate similarity instead of exact match. Useful when head cardinality is too high for exact caching; orthogonal to the head/tail split discussed here.

Trade-offs¶

Cache freshness is a product decision. Head queries' tags may drift when the catalog changes or seasonality shifts. TTL, refresh cadence, and invalidation on catalog events are all tuning knobs.
Head/tail split is a business-facing assumption. The 98/2 split is Instacart's; a different workload can have 80/20 or 70/30 or even 50/50. Measure before designing.
Tail quality is bounded by the student. Investing in a better offline teacher improves head cache and student training data — but the tail only sees the student's approximation. If the teacher-student quality gap is too large, tail quality lags unacceptably.
Offline pipeline → cache invalidation is a data-eng problem. The pipeline needs a complete-and-deploy-atomically discipline to avoid serving half-updated caches.
Monitoring discipline: head and tail have different SLOs. Head is cache-hit latency (should be ~0 ms); tail is student-inference latency (measured, e.g., p50 / p95). Alerting on "overall latency" blurs the two.

Caveats¶

Dual-use teacher output is not free. Designing the offline pipeline so its output is both production-cacheable AND training-data-grade requires discipline — the artefacts need schema validity, versioning, and dedup guarantees appropriate to both uses.
Precision-over-recall posture on the student. Instacart explicitly ships higher precision / lower recall for SRL because a confident-wrong tag is worse than a missing tag in retrieval context. Per-workload polarity needs reconfirming.

Seen in¶

sources/2025-11-13-instacart-building-the-intent-engine — canonical reference; hybrid cache + Llama-3-8B LoRA student at Instacart for production SRL with 98/2 head/tail split and explicit economic numbers.
sources/2025-02-04-yelp-search-query-understanding-with-llms — earliest wiki instance, pre-dates the Instacart canonicalisation by 9 months. Yelp's Yelp Query Understanding ships the pattern as a three-tier cascade: pre-computed head cache (expensive GPT-4 output) → offline-batch fine-tuned GPT-4o-mini for 95%+ coverage (review highlights) → realtime BERT / T5 for never-before-seen tail queries. Adds a "batch fine-tuned student between cache and realtime" intermediate layer that Instacart's two-tier variant doesn't; also applied to two tasks (segmentation + review-highlight expansion) in the same pipeline. Canonical 100× cost-reduction datum: "fine-tuned query understanding models ... have seen up to a 100x savings in cost, compared to using a complex GPT-4 prompt directly." Yelp's three-phase productionisation playbook — patterns/three-phase-llm-productionization — produces this architecture as the Phase 3 artefact.

patterns/offline-teacher-online-student-distillation — training-half companion pattern
patterns/three-phase-llm-productionization — the productionisation playbook that produces this architecture at Phase 3
concepts/query-frequency-power-law-caching — the caching substrate the head layer uses
patterns/teacher-student-model-compression — more general model-compression shape
patterns/cheap-approximator-with-expensive-fallback — inverted polarity sibling
patterns/traffic-aware-prerendering — web-rendering analogue
patterns/llm-extraction-cache-by-similarity — approximate-match cache variant
concepts/query-understanding / concepts/semantic-role-labeling / concepts/long-tail-query
concepts/power-law-url-traffic — web-side analogue
concepts/lora-low-rank-adaptation / concepts/adapter-merging — student-serving levers
systems/instacart-intent-engine
companies/instacart