Skip to content

PATTERN Cited by 1 source

Three-phase LLM productionization

Intent

Go from "we want to try an LLM for task X" to "LLM-powered task X serves all production traffic cost-effectively" via a disciplined three-phase playbook:

  1. Formulation — prototype with the strongest available LLM, iterate the prompt / task definition / output schema.
  2. Proof of Concept — pre-compute expensive-LLM output for the traffic head via a cache; run offline + online (A/B) evaluations at that coverage.
  3. Scaling Up — curate a golden dataset from the expensive- LLM outputs; fine-tune a smaller model for offline-batch serving; optionally fine-tune an even smaller model for realtime tail.

Each phase has a defined exit criterion, and each phase's artefact is the input to the next.

Canonical wiki instance: Yelp Query Understanding

Canonical reference: Yelp's 2025-02-04 post (sources/2025-02-04-yelp-search-query-understanding-with-llms). The post organises the entire retrospective around this exact three-phase process, applied to two running examples (query segmentation + review-highlight phrase expansion).

Phase 1 — Formulation

Goals per Yelp:

"(1) determine if an LLM is the appropriate tool for the problem, (2) define the ideal scope and output format for the task, and (3) assess the feasibility of combining multiple tasks into a single prompt."

What happens:

  • Prototype with the strongest available LLM (GPT-4 at Yelp's ingest; o1-preview / o1-mini for logical- reasoning-heavy tasks).
  • Iterate the prompt against real example inputs.
  • Decide the output schema — match it to downstream consumers, not internal training-data taxonomies (Yelp collapsed legacy topic sub-classes into a single topic tag because the downstream applications don't need the distinctions).
  • Decide whether to fuse tasks into one prompt. Yelp fused spell-correction + segmentation; left review-highlight expansion separate.
  • Decide on RAG side-inputs that ground the task — Yelp added "business names viewed for query" to segmentation; added "top business categories" to review-highlight.

Exit criterion: prompt + schema ready for cache-backed POC.

Phase 2 — Proof of Concept

What happens:

  • Exploit the query-frequency power-law to cover most of live traffic by pre-computing expensive-LLM output for head queries above a frequency threshold.
  • Integrate the cache into the existing production pipeline.
  • Run offline evaluations against human-labeled datasets (name-match, location-intent, etc.).
  • Run online A/B tests with the cached path enabled on a traffic slice.

Yelp's specific POCs:

  • Segmentation: offline wins on name-match and location- intent labeled datasets; online wins via implicit location rewrite using the {location} tag + token probabilities.
  • Review highlights: online A/B showed Session / Search CTR lift, "higher for less common queries in the tail."

Exit criterion: online A/B wins justify the scaling-up investment.

Phase 3 — Scaling Up

Yelp's explicit five-step sub-process:

  1. Iterate on the prompt using the 'expensive' model (GPT-4/o1). ... tracking the query level metrics to find those queries that have nontrivial traffic and that their metric is obviously worse than status-quo.
  2. Create a golden dataset for fine tuning smaller models. We ran the GPT-4 prompt on a representative sample of input queries. The sample size should be large (but not unmanageably so, since quality > quantity) and it should cover a diverse distribution of inputs.
  3. Improve the quality of the dataset if possible, prior to using it for fine tuning. With hard work here, it can be possible (for many tasks) to improve upon GPT-4's raw output. Try to isolate sets of inputs that are likely to have been mislabeled and target these for human re- labeling or removal.
  4. Fine tune a smaller model (GPT4o-mini) that we can run offline at the scale of tens of millions, and utilize this as a pre-computed cache ... up to a 100x savings in cost.
  5. Optionally, fine tune an even smaller model that is less expensive and fast (to run in real-time only for long-tail queries). Specifically, at Yelp, we have used BERT and T5 to serve as our real time LLM model.

Exit criterion: 100% traffic coverage at controlled cost.

Shape

  ┌──────────────────────────────────┐
  │  Phase 1 — Formulation           │
  │  ─────────────────────────────   │
  │  prototype with strongest LLM    │
  │  decide schema, task fusion, RAG │
  └──────────────┬───────────────────┘
            prompt + schema
  ┌──────────────────────────────────┐
  │  Phase 2 — Proof of Concept      │
  │  ─────────────────────────────   │
  │  cache head with expensive LLM   │
  │  offline eval + A/B test         │
  └──────────────┬───────────────────┘
       offline + online A/B wins
  ┌──────────────────────────────────┐
  │  Phase 3 — Scaling Up            │
  │  ─────────────────────────────   │
  │  golden dataset + curation       │
  │  fine-tune smaller model         │
  │  batch at scale, realtime tail   │
  └──────────────┬───────────────────┘
   production serving, 100% traffic

Why the three-phase shape matters

Each phase de-risks the next:

  • Phase 1 de-risks "will an LLM work at all for this task?" at minimum cost — no caching, no pipeline integration.
  • Phase 2 de-risks "does the LLM-enabled pipeline move real production metrics?" — via the cheap coverage mechanism of head-caching, not full deployment.
  • Phase 3 de-risks "can the cost be bounded at full production scale?" — via teacher-to-student distillation.

Skipping Phase 2 (going straight from prompt to full-scale fine-tuning) risks investing heavily in a model that A/B tests might have rejected. Skipping Phase 3 (keeping the expensive LLM in the live path at full traffic) risks cost blowout.

Relationship to other patterns

Tradeoffs / gotchas

  • Phase 2 A/B wins can be deceiving. Head-cache coverage that's heavily skewed toward the easiest queries can inflate the A/B delta relative to what full coverage would deliver. Check A/B wins on tail queries specifically before trusting them.
  • Golden-dataset curation is load-bearing in Phase 3. Yelp: "With hard work here, it can be possible (for many tasks) to improve upon GPT-4's raw output." Without curation, the student's quality ceiling is the raw teacher — which has known failure modes the curation step removes.
  • Prompt drift across phases. The prompt that works in Phase 1 may need reshaping for Phase 3 distillation; the student model's capacity constraints can break input/output assumptions the teacher handled implicitly.
  • The sub-process assumes OpenAI-style model progression. The Yelp post names GPT-4 → GPT-4o-mini + BERT/T5 — the specific models are replaceable (e.g. Instacart uses Llama-3-8B as its student); the structure is model- agnostic.
  • Realtime tail model is optional. If the head-cache covers enough traffic (Yelp review-highlights: 95%), a simple heuristic fallback may suffice. Yelp ships both BERT/T5 realtime and fallback heuristics depending on the task.
  • Phase 2 and Phase 3 can overlap. At Yelp, the GPT-4 prompt continues to be iterated throughout Phase 3 (sub- step 1) — Phase 3 doesn't stop needing the teacher.

Seen in

Last updated · 476 distilled / 1,218 read