CONCEPT Cited by 1 source
Continued pretraining¶
Definition¶
Continued pretraining (also called "continual pretraining" or "domain-adaptive pretraining") is the technique of taking an already-pretrained foundation model and running additional autoregressive language-modelling training on it with new data — typically a mix of domain-specific and general-domain corpora — to infuse new knowledge without rebuilding from scratch.
It sits between two adjacent techniques:
- From-scratch pretraining — train a new model on a custom corpus end-to-end. Maximum control; maximum time + resource cost.
- Fine-tuning / instruction-tuning / RLHF — supervised / preference-labeled training on relatively small datasets after pretraining; adjusts behaviour more than it infuses new knowledge.
Continued pretraining's target is knowledge infusion at the base-model level: the model after continued pretraining is still a base model (autoregressive next-token prediction), not yet a chatbot. Instruction tuning and RLHF typically follow.
Canonical wiki reference¶
eBay's e-Llama (2025-01-17, Tier 3) is the canonical example in this wiki: Llama 3.1 8B + 70B → 1 trillion additional tokens on a 1:1 mix of eBay e-commerce data and general-domain replay data, max LR = 10% of Llama 3.1's max LR, cosine schedule, ~85k update steps, batch size ~11.8M tokens, on 480 H100 GPUs via Megatron-LM 3D parallelism. Outcome: ~25% English / ~30% non-English domain-benchmark gain, ~1% general-domain regression on the 70B. (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)
The two load-bearing knobs¶
From eBay's small-scale sweep on e-Llama, the two knobs that determined continued-pretraining quality:
- Max learning rate relative to base pretraining. eBay's optimum: 10% of the original Llama 3.1 max LR. Rationale: a base model's weights encode a usable distribution; a too-high LR during continued pretraining moves those weights too aggressively and the model forgets what it knew. 10% is a useful default anchor for other teams doing Llama-class continued pretraining, though any particular recipe should re-sweep.
- General-to-domain data sampling ratio. eBay's optimum: 1:1. Half the step budget goes to replay of general-domain data (curated / publicly available / open-source + smaller high-quality datasets, plus 10% non-English), half to the new e-commerce data. This ratio is the explicit control for the catastrophic-forgetting trade-off.
The combination of these two knobs is what lets eBay report large domain gains (~25–30%) with small general-domain regression (~1% on 70B). (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)
Why continued pretraining (vs the alternatives)¶
- vs from-scratch: orders of magnitude less time + compute. eBay's 1T tokens on 480 H100s over ~1 month is already a large investment, but trivially smaller than Llama 3.1's full pretraining budget. For a domain with significant overlap to the base model's training distribution, continued pretraining reuses that overlap for free.
- vs fine-tuning only: fine-tuning typically doesn't shift the base knowledge — it shifts the behavior on small labeled data. If the domain contains genuinely new knowledge (new taxonomies, new vocabulary, new distributional properties), fine-tuning alone will not reliably teach it.
- vs RAG at inference time: RAG is orthogonal — retrieval-augmented generation scales the knowledge pool at serving time but does not change what the base model has internalised. Continued pretraining changes the model's defaults; RAG changes the context. Most production stacks combine both.
Data pipeline shape¶
The eBay recipe:
- Domain data — filtered + serialized eBay listings + product reviews, formatted for autoregressive LM.
- Classifier-extracted domain subset — a classifier trained to identify "e-commerce-specific" text, used to mine additional e-commerce examples from a larger open-source corpus (expands the domain-side budget without over-reliance on pure eBay data, which may be narrow in style/coverage).
- Replay general data — curated / publicly available / open-source datasets plus smaller high-quality ones, drawn to be close to the Llama 3.1 base distribution (so the replay signal actually reinforces existing knowledge rather than teaching slightly different knowledge).
- Non-English slice — 10% of general data is non-English, to preserve / enhance multilingual capability as a side-objective.
Caveats¶
- Recipe is corpus-dependent. The 10% / 1:1 optimum is eBay's; a different domain (biomedical, legal, code, etc.) may have different optima. Always re-sweep at small scale.
- Base model choice is load-bearing. Continued pretraining on a weak base will not beat from-scratch with a good recipe. Llama 3.1 is a strong base.
- Pair with monitoring for regression. eBay explicitly reports general-domain NLU regression as a tracked metric. Teams doing continued pretraining without general-domain eval are running blind to the forgetting cost.
- Instruction tuning + RLHF follow. The continued-pretrained model is still base; it produces next-token completions, not aligned answers.
- Small-scale sweep before the big run. eBay's posture: "We determine the optimal training setup through a series of experiments at a smaller scale." Continued pretraining at 480-GPU scale is expensive; small-scale sweeps find the LR + mix ratio before the compute commitment.
Seen in¶
- sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development — canonical reference; full recipe for Llama 3.1 → e-Llama 8B + 70B on 1T tokens at 480 H100s.
Related¶
- concepts/catastrophic-forgetting — the failure mode continued pretraining has to manage.
- concepts/replay-training — the countermeasure.
- concepts/training-serving-boundary — continued pretraining lives firmly on the training side.
- concepts/3d-parallelism — the distributed-training mechanism that makes continued pretraining at 1T-token / 70B scale feasible.
- concepts/knowledge-distillation — adjacent model-adaptation technique; different mechanism (teacher-student loss vs new-data LM loss).
- systems/e-llama / systems/llama-3-1 / systems/megatron-lm
- patterns/continued-pretraining-for-domain-adaptation — the end-to-end pattern.
- companies/ebay