SYSTEM Cited by 1 source

GPT-4o-mini¶

Definition¶

GPT-4o-mini is OpenAI's compact, cost-optimised variant of GPT-4o released 2024-07-18 and supporting fine-tuning. The smaller, cheaper sibling to GPT-4o positioned for high-volume production workloads.

Wiki anchor¶

The wiki's canonical anchor for GPT-4o-mini is its role as the offline-batch serving student in production LLM pipelines, canonicalised by the 2025-02-04 Yelp post (sources/2025-02-04-yelp-search-query-understanding-with-llms).

Yelp's canonical disclosure: "Fine tune a smaller model (GPT4o-mini) that we can run offline at the scale of tens of millions, and utilize this as a pre-computed cache to support that vast bulk of all traffic. Because fine-tuned query understanding models only require very short inputs and outputs, we have seen up to a 100x savings in cost, compared to using a complex GPT-4 prompt directly."

The operational datum ~100× cost reduction vs. direct GPT-4 prompt at equivalent quality on query-understanding tasks is the wiki's load-bearing number for GPT-4o-mini fine-tuning at production scale.

Production patterns¶

Teacher-student distillation: trained on the GPT-4- generated + human-curated golden dataset. See patterns/offline-teacher-online-student-distillation.
Offline batch serving via OpenAI batch API: Yelp pre- computes query-understanding responses at tens-of-millions scale via batch; live traffic serves from the resulting cache. See patterns/head-cache-plus-tail-finetuned-model.

Tradeoffs¶

Fine-tuning requires a high-quality golden dataset; quality curation (isolating + re-labeling mislabeled inputs) is load- bearing.
Short input + short output is a pre-condition for the 100× cost-reduction figure — longer contexts narrow the gap vs. the full GPT-4 model.

Seen in¶

sources/2025-02-04-yelp-search-query-understanding-with-llms — canonical wiki instance; offline-batch fine-tuned student for query understanding.

systems/gpt-4 — teacher
concepts/llm-cascade — cost-routing pattern
patterns/offline-teacher-online-student-distillation — the training-pipeline shape
patterns/head-cache-plus-tail-finetuned-model — the serving-architecture shape