Skip to content

CONCEPT Cited by 1 source

Continued pretraining

Definition

Continued pretraining (also called "continual pretraining" or "domain-adaptive pretraining") is the technique of taking an already-pretrained foundation model and running additional autoregressive language-modelling training on it with new data — typically a mix of domain-specific and general-domain corpora — to infuse new knowledge without rebuilding from scratch.

It sits between two adjacent techniques:

  • From-scratch pretraining — train a new model on a custom corpus end-to-end. Maximum control; maximum time + resource cost.
  • Fine-tuning / instruction-tuning / RLHF — supervised / preference-labeled training on relatively small datasets after pretraining; adjusts behaviour more than it infuses new knowledge.

Continued pretraining's target is knowledge infusion at the base-model level: the model after continued pretraining is still a base model (autoregressive next-token prediction), not yet a chatbot. Instruction tuning and RLHF typically follow.

Canonical wiki reference

eBay's e-Llama (2025-01-17, Tier 3) is the canonical example in this wiki: Llama 3.1 8B + 70B → 1 trillion additional tokens on a 1:1 mix of eBay e-commerce data and general-domain replay data, max LR = 10% of Llama 3.1's max LR, cosine schedule, ~85k update steps, batch size ~11.8M tokens, on 480 H100 GPUs via Megatron-LM 3D parallelism. Outcome: ~25% English / ~30% non-English domain-benchmark gain, ~1% general-domain regression on the 70B. (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)

The two load-bearing knobs

From eBay's small-scale sweep on e-Llama, the two knobs that determined continued-pretraining quality:

  1. Max learning rate relative to base pretraining. eBay's optimum: 10% of the original Llama 3.1 max LR. Rationale: a base model's weights encode a usable distribution; a too-high LR during continued pretraining moves those weights too aggressively and the model forgets what it knew. 10% is a useful default anchor for other teams doing Llama-class continued pretraining, though any particular recipe should re-sweep.
  2. General-to-domain data sampling ratio. eBay's optimum: 1:1. Half the step budget goes to replay of general-domain data (curated / publicly available / open-source + smaller high-quality datasets, plus 10% non-English), half to the new e-commerce data. This ratio is the explicit control for the catastrophic-forgetting trade-off.

The combination of these two knobs is what lets eBay report large domain gains (~25–30%) with small general-domain regression (~1% on 70B). (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)

Why continued pretraining (vs the alternatives)

  • vs from-scratch: orders of magnitude less time + compute. eBay's 1T tokens on 480 H100s over ~1 month is already a large investment, but trivially smaller than Llama 3.1's full pretraining budget. For a domain with significant overlap to the base model's training distribution, continued pretraining reuses that overlap for free.
  • vs fine-tuning only: fine-tuning typically doesn't shift the base knowledge — it shifts the behavior on small labeled data. If the domain contains genuinely new knowledge (new taxonomies, new vocabulary, new distributional properties), fine-tuning alone will not reliably teach it.
  • vs RAG at inference time: RAG is orthogonal — retrieval-augmented generation scales the knowledge pool at serving time but does not change what the base model has internalised. Continued pretraining changes the model's defaults; RAG changes the context. Most production stacks combine both.

Data pipeline shape

The eBay recipe:

  • Domain data — filtered + serialized eBay listings + product reviews, formatted for autoregressive LM.
  • Classifier-extracted domain subset — a classifier trained to identify "e-commerce-specific" text, used to mine additional e-commerce examples from a larger open-source corpus (expands the domain-side budget without over-reliance on pure eBay data, which may be narrow in style/coverage).
  • Replay general data — curated / publicly available / open-source datasets plus smaller high-quality ones, drawn to be close to the Llama 3.1 base distribution (so the replay signal actually reinforces existing knowledge rather than teaching slightly different knowledge).
  • Non-English slice — 10% of general data is non-English, to preserve / enhance multilingual capability as a side-objective.

Caveats

  • Recipe is corpus-dependent. The 10% / 1:1 optimum is eBay's; a different domain (biomedical, legal, code, etc.) may have different optima. Always re-sweep at small scale.
  • Base model choice is load-bearing. Continued pretraining on a weak base will not beat from-scratch with a good recipe. Llama 3.1 is a strong base.
  • Pair with monitoring for regression. eBay explicitly reports general-domain NLU regression as a tracked metric. Teams doing continued pretraining without general-domain eval are running blind to the forgetting cost.
  • Instruction tuning + RLHF follow. The continued-pretrained model is still base; it produces next-token completions, not aligned answers.
  • Small-scale sweep before the big run. eBay's posture: "We determine the optimal training setup through a series of experiments at a smaller scale." Continued pretraining at 480-GPU scale is expensive; small-scale sweeps find the LR + mix ratio before the compute commitment.

Seen in

Last updated · 200 distilled / 1,178 read