CONCEPT Cited by 2 sources

Continued pretraining¶

Definition¶

Continued pretraining (also called "continual pretraining" or "domain-adaptive pretraining") is the technique of taking an already-pretrained foundation model and running additional autoregressive language-modelling training on it with new data — typically a mix of domain-specific and general-domain corpora — to infuse new knowledge without rebuilding from scratch.

It sits between two adjacent techniques:

From-scratch pretraining — train a new model on a custom corpus end-to-end. Maximum control; maximum time + resource cost.
Fine-tuning / instruction-tuning / RLHF — supervised / preference-labeled training on relatively small datasets after pretraining; adjusts behaviour more than it infuses new knowledge.

Continued pretraining's target is knowledge infusion at the base-model level: the model after continued pretraining is still a base model (autoregressive next-token prediction), not yet a chatbot. Instruction tuning and RLHF typically follow.

Canonical wiki reference¶

eBay's e-Llama (2025-01-17, Tier 3) is the canonical example in this wiki: Llama 3.1 8B + 70B → 1 trillion additional tokens on a 1:1 mix of eBay e-commerce data and general-domain replay data, max LR = 10% of Llama 3.1's max LR, cosine schedule, ~85k update steps, batch size ~11.8M tokens, on 480 H100 GPUs via Megatron-LM 3D parallelism. Outcome: ~25% English / ~30% non-English domain-benchmark gain, ~1% general-domain regression on the 70B. (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)

The two load-bearing knobs¶

From eBay's small-scale sweep on e-Llama, the two knobs that determined continued-pretraining quality:

Max learning rate relative to base pretraining. eBay's optimum: 10% of the original Llama 3.1 max LR. Rationale: a base model's weights encode a usable distribution; a too-high LR during continued pretraining moves those weights too aggressively and the model forgets what it knew. 10% is a useful default anchor for other teams doing Llama-class continued pretraining, though any particular recipe should re-sweep.
General-to-domain data sampling ratio. eBay's optimum: 1:1. Half the step budget goes to replay of general-domain data (curated / publicly available / open-source + smaller high-quality datasets, plus 10% non-English), half to the new e-commerce data. This ratio is the explicit control for the catastrophic-forgetting trade-off.

The combination of these two knobs is what lets eBay report large domain gains (~25–30%) with small general-domain regression (~1% on 70B). (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)

Why continued pretraining (vs the alternatives)¶

vs from-scratch: orders of magnitude less time + compute. eBay's 1T tokens on 480 H100s over ~1 month is already a large investment, but trivially smaller than Llama 3.1's full pretraining budget. For a domain with significant overlap to the base model's training distribution, continued pretraining reuses that overlap for free.
vs fine-tuning only: fine-tuning typically doesn't shift the base knowledge — it shifts the behavior on small labeled data. If the domain contains genuinely new knowledge (new taxonomies, new vocabulary, new distributional properties), fine-tuning alone will not reliably teach it.
vs RAG at inference time: RAG is orthogonal — retrieval-augmented generation scales the knowledge pool at serving time but does not change what the base model has internalised. Continued pretraining changes the model's defaults; RAG changes the context. Most production stacks combine both.

Data pipeline shape¶

The eBay recipe:

Domain data — filtered + serialized eBay listings + product reviews, formatted for autoregressive LM.
Classifier-extracted domain subset — a classifier trained to identify "e-commerce-specific" text, used to mine additional e-commerce examples from a larger open-source corpus (expands the domain-side budget without over-reliance on pure eBay data, which may be narrow in style/coverage).
Replay general data — curated / publicly available / open-source datasets plus smaller high-quality ones, drawn to be close to the Llama 3.1 base distribution (so the replay signal actually reinforces existing knowledge rather than teaching slightly different knowledge).
Non-English slice — 10% of general data is non-English, to preserve / enhance multilingual capability as a side-objective.

Caveats¶

Recipe is corpus-dependent. The 10% / 1:1 optimum is eBay's; a different domain (biomedical, legal, code, etc.) may have different optima. Always re-sweep at small scale.
Base model choice is load-bearing. Continued pretraining on a weak base will not beat from-scratch with a good recipe. Llama 3.1 is a strong base.
Pair with monitoring for regression. eBay explicitly reports general-domain NLU regression as a tracked metric. Teams doing continued pretraining without general-domain eval are running blind to the forgetting cost.
Instruction tuning + RLHF follow. The continued-pretrained model is still base; it produces next-token completions, not aligned answers.
Small-scale sweep before the big run. eBay's posture: "We determine the optimal training setup through a series of experiments at a smaller scale." Continued pretraining at 480-GPU scale is expensive; small-scale sweeps find the LR + mix ratio before the compute commitment.

Seen in¶

sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development — canonical reference; full recipe for Llama 3.1 → e-Llama 8B + 70B on 1T tokens at 480 H100s.
sources/2024-08-23-meta-leveraging-ai-for-efficient-incident-response — Meta's small-base-model / proprietary-corpus variant. CPT on Llama 2 (7B) using "limited and approved internal wikis, Q&As, and code" to expose the model to Meta artefacts, before SFT on RCA-specific instruction-tuning data. Shows the pattern applies below 70B and on a ~GB rather than TB corpus scale; the corpus composition differs (internal documentation + code, not publicly-available e-commerce data) but the base-knowledge-infusion role is identical. No data or schedule numbers are disclosed — only that CPT precedes SFT in Meta's adaptation pipeline for the RCA ranker.

concepts/catastrophic-forgetting — the failure mode continued pretraining has to manage.
concepts/replay-training — the countermeasure.
concepts/training-serving-boundary — continued pretraining lives firmly on the training side.
concepts/3d-parallelism — the distributed-training mechanism that makes continued pretraining at 1T-token / 70B scale feasible.
concepts/knowledge-distillation — adjacent model-adaptation technique; different mechanism (teacher-student loss vs new-data LM loss).
concepts/supervised-fine-tuning — the typical downstream stage.
concepts/llm-based-ranker — the architectural role Meta's CPT+SFT pipeline produces.
systems/e-llama / systems/llama-3-1 / systems/llama-2 / systems/megatron-lm / systems/meta-rca-system
patterns/continued-pretraining-for-domain-adaptation — the end-to-end pattern.
companies/ebay / companies/meta