PATTERN Cited by 1 source

Continued pretraining for domain adaptation¶

Pattern¶

Take a capable open-weights foundation model, continue-pretraining it on a balanced mix of domain-specific and general-domain replay data, with a carefully-tuned low learning rate and small-scale-swept hyperparameters, orchestrated via 3D parallelism on a multi-node GPU cluster — then follow with instruction tuning and RLHF alignment.

The pattern is the practical recipe for enterprises that (a) have significant proprietary domain data, (b) need a model that "knows" that domain deeper than RAG or fine-tuning alone can provide, but (c) cannot justify the cost and time of a from-scratch pretraining run.

Canonical wiki instance¶

eBay's e-Llama (2025-01-17, Tier 3) is the canonical example:

Base: Meta Llama 3.1 (8B + 70B).
Domain: e-commerce (listings, reviews, classifier-extracted e-commerce subset of open-source data).
Replay mix: 1 : 1 general-to-domain; general side = curated / publicly available / open-source + smaller high-quality sets + 10% non-English.
Hyperparameters: max LR = 10% of Llama 3.1's max LR, cosine schedule with warmup, batch size ~11.8M tokens, ~85k update steps = 1 trillion tokens total.
Training topology: 480 H100 80GB GPUs (60 nodes × 8), NVLink intra-node + InfiniBand inter-node, Megatron-LM with concepts/3d-parallelism|3D parallelism + distributed optimizer + flash-attention-2.
70B wall-clock: ~1 month. 70B GPU-hours: ~340,000.
Benchmark outcome: ~25% English / ~30% non-English gain on e-commerce benchmarks; ~1% general-domain NLU regression on the 70B.
Post-training: instruction tuning + RLHF alignment.
(Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)

Steps¶

Pick a strong open-weights base. Llama 3.1 / Mistral / DeepSeek / Qwen — the base provides the scaffolding; continued pretraining won't rescue a weak base.
Prepare domain data. Filter + deduplicate + serialize for autoregressive LM. Train a domain classifier (a small supervised model) and use it to extract domain-specific examples from a larger open-source corpus — expands the domain-side data budget without over-fitting to stylistic narrowness.
Prepare replay data. Curated / publicly available / open-source corpora resembling the base's pretraining distribution. Include a sliver of non-English if multilingual capability matters for the target application.
Small-scale sweep for hyperparameters. Run short jobs (probably at 1B-7B scale, not 70B) to identify:
Max LR — usually a fraction of the base's max LR. eBay's optimum: 10%. Start in the 5-20% range and sweep.
Data mix ratio — general-to-domain. eBay's optimum: 1:1. Sweep 4:1 → 1:1 → 1:4.
Batch size — driven by the hardware budget, also a plasticity-stability knob.
Scale to the full run on a 3D-parallel cluster. Megatron-LM / DeepSpeed / NeMo. Compose TP (within NVLink domain) + PP (across InfiniBand) + DP (fills the rest). Use distributed optimizer (ZeRO-style) + flash-attention-2 + activation checkpointing.
Track both domain and general-domain benchmarks during training. Domain-only reporting is blind to catastrophic forgetting. eBay reports both: ~25-30% domain gain AND ~1% general regression. Tracking forgetting is the discipline.
Post-train. Instruction tuning on domain-curated supervised data. RLHF (or DPO / IPO / KTO) alignment on preference data. Safety-evaluation harness.
Ship to production via whatever inference stack. The continued-pretrained model is a base/aligned model — deployment is orthogonal to the training recipe.

Why this pattern over alternatives¶

Alternative	When it's better	When this pattern is better
From-scratch pretraining	you need full control over license / vocab / architecture; you have $100M+ compute budget; eBay's sister-track LiLiuM does this	you need a capable domain-adapted model in months-not-years
LoRA / parameter-efficient fine-tune	you need a lightweight "flavor" of the base; you can't afford full continued pretraining	the domain requires genuine new knowledge not expressible by a small delta
Instruction tuning / fine-tuning only	you need to change behavior on small labeled data; the base already "knows" the domain	the domain has significant novel distributional properties the base doesn't encode
RAG at inference time	knowledge changes too frequently to bake into weights; you need citation transparency	you need latent fluency in the domain, not just retrieval-dependent answers; combine with RAG for best results

In practice, continued pretraining + RAG is a common combination: continued-pretrain for latent fluency, RAG for up-to-date facts.

Caveats¶

Recipe is corpus-dependent. eBay's 10% / 1:1 are eBay's. Biomedical / legal / financial / code domains will have different optima. Always re-sweep.
Forgetting is not zero. Even with the best replay recipe, expect ~1-5% general-domain regression at frontier scales. Budget for that.
Expensive. 480 H100s for ~1 month = ~340k GPU-hours on the 70B. At typical 2025 cloud GPU pricing ($2-4/H100-hour) that's $680k-$1.4M per training run. Only makes sense if domain-adaptation value > cost.
Checkpoint cadence + failure recovery matter. 1-month 480-GPU runs will experience hardware failures. Recipes should have checkpoint-every-N-steps + resume-from-checkpoint story; eBay's specifics are not disclosed.
Benchmark identity matters. "~25% gain on e-commerce benchmarks" is meaningful only relative to a named benchmark. eBay doesn't name theirs in the blog post; the companion arXiv:2501.09706 presumably does. Beware of unnamed benchmarks in other papers / blog posts.

patterns/teacher-student-model-compression — different mechanism: a big teacher distills knowledge into a smaller student via soft-label supervised training, not new-data autoregressive LM. Produces smaller models from bigger ones; doesn't adapt a model to a new domain.
patterns/prototype-before-production — small-scale sweep before the big run is a specific instance of this meta-pattern.

Seen in¶

sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development — canonical instance: Llama 3.1 8B + 70B → e-Llama, 1T tokens, 480 H100s, 10%/1:1/10% hyperparameters.

concepts/continued-pretraining / concepts/catastrophic-forgetting / concepts/replay-training
concepts/3d-parallelism / concepts/data-parallelism / concepts/tensor-parallelism / concepts/pipeline-parallelism
systems/e-llama / systems/llama-3-1 / systems/megatron-lm / systems/flash-attention-2
systems/nvidia-h100 / systems/nvlink / systems/infiniband
patterns/teacher-student-model-compression — adjacent model-transformation pattern.
patterns/prototype-before-production — the small-scale-sweep meta-pattern.
companies/ebay