PATTERN Cited by 1 source
Three-phase LLM productionization¶
Intent¶
Go from "we want to try an LLM for task X" to "LLM-powered task X serves all production traffic cost-effectively" via a disciplined three-phase playbook:
- Formulation — prototype with the strongest available LLM, iterate the prompt / task definition / output schema.
- Proof of Concept — pre-compute expensive-LLM output for the traffic head via a cache; run offline + online (A/B) evaluations at that coverage.
- Scaling Up — curate a golden dataset from the expensive- LLM outputs; fine-tune a smaller model for offline-batch serving; optionally fine-tune an even smaller model for realtime tail.
Each phase has a defined exit criterion, and each phase's artefact is the input to the next.
Canonical wiki instance: Yelp Query Understanding¶
Canonical reference: Yelp's 2025-02-04 post (sources/2025-02-04-yelp-search-query-understanding-with-llms). The post organises the entire retrospective around this exact three-phase process, applied to two running examples (query segmentation + review-highlight phrase expansion).
Phase 1 — Formulation¶
Goals per Yelp:
"(1) determine if an LLM is the appropriate tool for the problem, (2) define the ideal scope and output format for the task, and (3) assess the feasibility of combining multiple tasks into a single prompt."
What happens:
- Prototype with the strongest available LLM (GPT-4 at Yelp's ingest; o1-preview / o1-mini for logical- reasoning-heavy tasks).
- Iterate the prompt against real example inputs.
- Decide the output schema — match it to downstream
consumers, not internal training-data taxonomies (Yelp
collapsed legacy
topicsub-classes into a singletopictag because the downstream applications don't need the distinctions). - Decide whether to fuse tasks into one prompt. Yelp fused spell-correction + segmentation; left review-highlight expansion separate.
- Decide on RAG side-inputs that ground the task — Yelp added "business names viewed for query" to segmentation; added "top business categories" to review-highlight.
Exit criterion: prompt + schema ready for cache-backed POC.
Phase 2 — Proof of Concept¶
What happens:
- Exploit the query-frequency power-law to cover most of live traffic by pre-computing expensive-LLM output for head queries above a frequency threshold.
- Integrate the cache into the existing production pipeline.
- Run offline evaluations against human-labeled datasets (name-match, location-intent, etc.).
- Run online A/B tests with the cached path enabled on a traffic slice.
Yelp's specific POCs:
- Segmentation: offline wins on name-match and location-
intent labeled datasets; online wins via implicit location
rewrite using the
{location}tag + token probabilities. - Review highlights: online A/B showed Session / Search CTR lift, "higher for less common queries in the tail."
Exit criterion: online A/B wins justify the scaling-up investment.
Phase 3 — Scaling Up¶
Yelp's explicit five-step sub-process:
- Iterate on the prompt using the 'expensive' model (GPT-4/o1). ... tracking the query level metrics to find those queries that have nontrivial traffic and that their metric is obviously worse than status-quo.
- Create a golden dataset for fine tuning smaller models. We ran the GPT-4 prompt on a representative sample of input queries. The sample size should be large (but not unmanageably so, since quality > quantity) and it should cover a diverse distribution of inputs.
- Improve the quality of the dataset if possible, prior to using it for fine tuning. With hard work here, it can be possible (for many tasks) to improve upon GPT-4's raw output. Try to isolate sets of inputs that are likely to have been mislabeled and target these for human re- labeling or removal.
- Fine tune a smaller model (GPT4o-mini) that we can run offline at the scale of tens of millions, and utilize this as a pre-computed cache ... up to a 100x savings in cost.
- Optionally, fine tune an even smaller model that is less expensive and fast (to run in real-time only for long-tail queries). Specifically, at Yelp, we have used BERT and T5 to serve as our real time LLM model.
Exit criterion: 100% traffic coverage at controlled cost.
Shape¶
┌──────────────────────────────────┐
│ Phase 1 — Formulation │
│ ───────────────────────────── │
│ prototype with strongest LLM │
│ decide schema, task fusion, RAG │
└──────────────┬───────────────────┘
│
prompt + schema
│
▼
┌──────────────────────────────────┐
│ Phase 2 — Proof of Concept │
│ ───────────────────────────── │
│ cache head with expensive LLM │
│ offline eval + A/B test │
└──────────────┬───────────────────┘
│
offline + online A/B wins
│
▼
┌──────────────────────────────────┐
│ Phase 3 — Scaling Up │
│ ───────────────────────────── │
│ golden dataset + curation │
│ fine-tune smaller model │
│ batch at scale, realtime tail │
└──────────────┬───────────────────┘
│
production serving, 100% traffic
Why the three-phase shape matters¶
Each phase de-risks the next:
- Phase 1 de-risks "will an LLM work at all for this task?" at minimum cost — no caching, no pipeline integration.
- Phase 2 de-risks "does the LLM-enabled pipeline move real production metrics?" — via the cheap coverage mechanism of head-caching, not full deployment.
- Phase 3 de-risks "can the cost be bounded at full production scale?" — via teacher-to-student distillation.
Skipping Phase 2 (going straight from prompt to full-scale fine-tuning) risks investing heavily in a model that A/B tests might have rejected. Skipping Phase 3 (keeping the expensive LLM in the live path at full traffic) risks cost blowout.
Relationship to other patterns¶
- patterns/head-cache-plus-tail-finetuned-model — the serving-architecture shape this pattern converges to at the end of Phase 3. Three-phase productionisation produces a head-cache-plus-tail system.
- patterns/offline-teacher-online-student-distillation — the training-pipeline shape for Phase 3. Three-phase productionisation uses teacher-student distillation as its Phase-3 mechanism.
- patterns/complexity-tiered-model-selection — adjacent routing pattern. Three-phase productionisation targets a single task and produces a cascade; complexity-tiered selection targets the routing axis within a serving system.
Tradeoffs / gotchas¶
- Phase 2 A/B wins can be deceiving. Head-cache coverage that's heavily skewed toward the easiest queries can inflate the A/B delta relative to what full coverage would deliver. Check A/B wins on tail queries specifically before trusting them.
- Golden-dataset curation is load-bearing in Phase 3. Yelp: "With hard work here, it can be possible (for many tasks) to improve upon GPT-4's raw output." Without curation, the student's quality ceiling is the raw teacher — which has known failure modes the curation step removes.
- Prompt drift across phases. The prompt that works in Phase 1 may need reshaping for Phase 3 distillation; the student model's capacity constraints can break input/output assumptions the teacher handled implicitly.
- The sub-process assumes OpenAI-style model progression. The Yelp post names GPT-4 → GPT-4o-mini + BERT/T5 — the specific models are replaceable (e.g. Instacart uses Llama-3-8B as its student); the structure is model- agnostic.
- Realtime tail model is optional. If the head-cache covers enough traffic (Yelp review-highlights: 95%), a simple heuristic fallback may suffice. Yelp ships both BERT/T5 realtime and fallback heuristics depending on the task.
- Phase 2 and Phase 3 can overlap. At Yelp, the GPT-4 prompt continues to be iterated throughout Phase 3 (sub- step 1) — Phase 3 doesn't stop needing the teacher.
Seen in¶
- sources/2025-02-04-yelp-search-query-understanding-with-llms — canonical wiki reference; first-party three-phase productionisation playbook applied to query segmentation + spell correction and review-highlight phrase expansion.
Related¶
- patterns/head-cache-plus-tail-finetuned-model — serving- architecture counterpart
- patterns/offline-teacher-online-student-distillation — Phase-3 training-pipeline shape
- patterns/complexity-tiered-model-selection — adjacent routing pattern
- concepts/llm-cascade — adjacent cost-routing primitive
- concepts/query-frequency-power-law-caching — Phase 2's caching substrate
- systems/yelp-query-understanding — canonical production instance
- companies/yelp