PATTERN Cited by 1 source
Continued pretraining for domain adaptation¶
Pattern¶
Take a capable open-weights foundation model, continue-pretraining it on a balanced mix of domain-specific and general-domain replay data, with a carefully-tuned low learning rate and small-scale-swept hyperparameters, orchestrated via 3D parallelism on a multi-node GPU cluster — then follow with instruction tuning and RLHF alignment.
The pattern is the practical recipe for enterprises that (a) have significant proprietary domain data, (b) need a model that "knows" that domain deeper than RAG or fine-tuning alone can provide, but (c) cannot justify the cost and time of a from-scratch pretraining run.
Canonical wiki instance¶
eBay's e-Llama (2025-01-17, Tier 3) is the canonical example:
- Base: Meta Llama 3.1 (8B + 70B).
- Domain: e-commerce (listings, reviews, classifier-extracted e-commerce subset of open-source data).
- Replay mix: 1 : 1 general-to-domain; general side = curated / publicly available / open-source + smaller high-quality sets + 10% non-English.
- Hyperparameters: max LR = 10% of Llama 3.1's max LR, cosine schedule with warmup, batch size ~11.8M tokens, ~85k update steps = 1 trillion tokens total.
- Training topology: 480 H100 80GB GPUs (60 nodes × 8), NVLink intra-node + InfiniBand inter-node, Megatron-LM with concepts/3d-parallelism|3D parallelism + distributed optimizer + flash-attention-2.
- 70B wall-clock: ~1 month. 70B GPU-hours: ~340,000.
- Benchmark outcome: ~25% English / ~30% non-English gain on e-commerce benchmarks; ~1% general-domain NLU regression on the 70B.
- Post-training: instruction tuning + RLHF alignment.
- (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)
Steps¶
- Pick a strong open-weights base. Llama 3.1 / Mistral / DeepSeek / Qwen — the base provides the scaffolding; continued pretraining won't rescue a weak base.
- Prepare domain data. Filter + deduplicate + serialize for autoregressive LM. Train a domain classifier (a small supervised model) and use it to extract domain-specific examples from a larger open-source corpus — expands the domain-side data budget without over-fitting to stylistic narrowness.
- Prepare replay data. Curated / publicly available / open-source corpora resembling the base's pretraining distribution. Include a sliver of non-English if multilingual capability matters for the target application.
- Small-scale sweep for hyperparameters. Run short jobs (probably at 1B-7B scale, not 70B) to identify:
- Max LR — usually a fraction of the base's max LR. eBay's optimum: 10%. Start in the 5-20% range and sweep.
- Data mix ratio — general-to-domain. eBay's optimum: 1:1. Sweep 4:1 → 1:1 → 1:4.
- Batch size — driven by the hardware budget, also a plasticity-stability knob.
- Scale to the full run on a 3D-parallel cluster. Megatron-LM / DeepSpeed / NeMo. Compose TP (within NVLink domain) + PP (across InfiniBand) + DP (fills the rest). Use distributed optimizer (ZeRO-style) + flash-attention-2 + activation checkpointing.
- Track both domain and general-domain benchmarks during training. Domain-only reporting is blind to catastrophic forgetting. eBay reports both: ~25-30% domain gain AND ~1% general regression. Tracking forgetting is the discipline.
- Post-train. Instruction tuning on domain-curated supervised data. RLHF (or DPO / IPO / KTO) alignment on preference data. Safety-evaluation harness.
- Ship to production via whatever inference stack. The continued-pretrained model is a base/aligned model — deployment is orthogonal to the training recipe.
Why this pattern over alternatives¶
| Alternative | When it's better | When this pattern is better |
|---|---|---|
| From-scratch pretraining | you need full control over license / vocab / architecture; you have $100M+ compute budget; eBay's sister-track LiLiuM does this | you need a capable domain-adapted model in months-not-years |
| LoRA / parameter-efficient fine-tune | you need a lightweight "flavor" of the base; you can't afford full continued pretraining | the domain requires genuine new knowledge not expressible by a small delta |
| Instruction tuning / fine-tuning only | you need to change behavior on small labeled data; the base already "knows" the domain | the domain has significant novel distributional properties the base doesn't encode |
| RAG at inference time | knowledge changes too frequently to bake into weights; you need citation transparency | you need latent fluency in the domain, not just retrieval-dependent answers; combine with RAG for best results |
In practice, continued pretraining + RAG is a common combination: continued-pretrain for latent fluency, RAG for up-to-date facts.
Caveats¶
- Recipe is corpus-dependent. eBay's 10% / 1:1 are eBay's. Biomedical / legal / financial / code domains will have different optima. Always re-sweep.
- Forgetting is not zero. Even with the best replay recipe, expect ~1-5% general-domain regression at frontier scales. Budget for that.
- Expensive. 480 H100s for ~1 month = ~340k GPU-hours on the 70B. At typical 2025 cloud GPU pricing ($2-4/H100-hour) that's $680k-$1.4M per training run. Only makes sense if domain-adaptation value > cost.
- Checkpoint cadence + failure recovery matter. 1-month 480-GPU runs will experience hardware failures. Recipes should have checkpoint-every-N-steps + resume-from-checkpoint story; eBay's specifics are not disclosed.
- Benchmark identity matters. "~25% gain on e-commerce benchmarks" is meaningful only relative to a named benchmark. eBay doesn't name theirs in the blog post; the companion arXiv:2501.09706 presumably does. Beware of unnamed benchmarks in other papers / blog posts.
Contrast with related patterns¶
- patterns/teacher-student-model-compression — different mechanism: a big teacher distills knowledge into a smaller student via soft-label supervised training, not new-data autoregressive LM. Produces smaller models from bigger ones; doesn't adapt a model to a new domain.
- patterns/prototype-before-production — small-scale sweep before the big run is a specific instance of this meta-pattern.
Seen in¶
- sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development — canonical instance: Llama 3.1 8B + 70B → e-Llama, 1T tokens, 480 H100s, 10%/1:1/10% hyperparameters.
Related¶
- concepts/continued-pretraining / concepts/catastrophic-forgetting / concepts/replay-training
- concepts/3d-parallelism / concepts/data-parallelism / concepts/tensor-parallelism / concepts/pipeline-parallelism
- systems/e-llama / systems/llama-3-1 / systems/megatron-lm / systems/flash-attention-2
- systems/nvidia-h100 / systems/nvlink / systems/infiniband
- patterns/teacher-student-model-compression — adjacent model-transformation pattern.
- patterns/prototype-before-production — the small-scale-sweep meta-pattern.
- companies/ebay