Skip to content

PATTERN Cited by 1 source

Continued pretraining for domain adaptation

Pattern

Take a capable open-weights foundation model, continue-pretraining it on a balanced mix of domain-specific and general-domain replay data, with a carefully-tuned low learning rate and small-scale-swept hyperparameters, orchestrated via 3D parallelism on a multi-node GPU cluster — then follow with instruction tuning and RLHF alignment.

The pattern is the practical recipe for enterprises that (a) have significant proprietary domain data, (b) need a model that "knows" that domain deeper than RAG or fine-tuning alone can provide, but (c) cannot justify the cost and time of a from-scratch pretraining run.

Canonical wiki instance

eBay's e-Llama (2025-01-17, Tier 3) is the canonical example:

  • Base: Meta Llama 3.1 (8B + 70B).
  • Domain: e-commerce (listings, reviews, classifier-extracted e-commerce subset of open-source data).
  • Replay mix: 1 : 1 general-to-domain; general side = curated / publicly available / open-source + smaller high-quality sets + 10% non-English.
  • Hyperparameters: max LR = 10% of Llama 3.1's max LR, cosine schedule with warmup, batch size ~11.8M tokens, ~85k update steps = 1 trillion tokens total.
  • Training topology: 480 H100 80GB GPUs (60 nodes × 8), NVLink intra-node + InfiniBand inter-node, Megatron-LM with concepts/3d-parallelism|3D parallelism + distributed optimizer + flash-attention-2.
  • 70B wall-clock: ~1 month. 70B GPU-hours: ~340,000.
  • Benchmark outcome: ~25% English / ~30% non-English gain on e-commerce benchmarks; ~1% general-domain NLU regression on the 70B.
  • Post-training: instruction tuning + RLHF alignment.
  • (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)

Steps

  1. Pick a strong open-weights base. Llama 3.1 / Mistral / DeepSeek / Qwen — the base provides the scaffolding; continued pretraining won't rescue a weak base.
  2. Prepare domain data. Filter + deduplicate + serialize for autoregressive LM. Train a domain classifier (a small supervised model) and use it to extract domain-specific examples from a larger open-source corpus — expands the domain-side data budget without over-fitting to stylistic narrowness.
  3. Prepare replay data. Curated / publicly available / open-source corpora resembling the base's pretraining distribution. Include a sliver of non-English if multilingual capability matters for the target application.
  4. Small-scale sweep for hyperparameters. Run short jobs (probably at 1B-7B scale, not 70B) to identify:
  5. Max LR — usually a fraction of the base's max LR. eBay's optimum: 10%. Start in the 5-20% range and sweep.
  6. Data mix ratio — general-to-domain. eBay's optimum: 1:1. Sweep 4:1 → 1:1 → 1:4.
  7. Batch size — driven by the hardware budget, also a plasticity-stability knob.
  8. Scale to the full run on a 3D-parallel cluster. Megatron-LM / DeepSpeed / NeMo. Compose TP (within NVLink domain) + PP (across InfiniBand) + DP (fills the rest). Use distributed optimizer (ZeRO-style) + flash-attention-2 + activation checkpointing.
  9. Track both domain and general-domain benchmarks during training. Domain-only reporting is blind to catastrophic forgetting. eBay reports both: ~25-30% domain gain AND ~1% general regression. Tracking forgetting is the discipline.
  10. Post-train. Instruction tuning on domain-curated supervised data. RLHF (or DPO / IPO / KTO) alignment on preference data. Safety-evaluation harness.
  11. Ship to production via whatever inference stack. The continued-pretrained model is a base/aligned model — deployment is orthogonal to the training recipe.

Why this pattern over alternatives

Alternative When it's better When this pattern is better
From-scratch pretraining you need full control over license / vocab / architecture; you have $100M+ compute budget; eBay's sister-track LiLiuM does this you need a capable domain-adapted model in months-not-years
LoRA / parameter-efficient fine-tune you need a lightweight "flavor" of the base; you can't afford full continued pretraining the domain requires genuine new knowledge not expressible by a small delta
Instruction tuning / fine-tuning only you need to change behavior on small labeled data; the base already "knows" the domain the domain has significant novel distributional properties the base doesn't encode
RAG at inference time knowledge changes too frequently to bake into weights; you need citation transparency you need latent fluency in the domain, not just retrieval-dependent answers; combine with RAG for best results

In practice, continued pretraining + RAG is a common combination: continued-pretrain for latent fluency, RAG for up-to-date facts.

Caveats

  • Recipe is corpus-dependent. eBay's 10% / 1:1 are eBay's. Biomedical / legal / financial / code domains will have different optima. Always re-sweep.
  • Forgetting is not zero. Even with the best replay recipe, expect ~1-5% general-domain regression at frontier scales. Budget for that.
  • Expensive. 480 H100s for ~1 month = ~340k GPU-hours on the 70B. At typical 2025 cloud GPU pricing ($2-4/H100-hour) that's $680k-$1.4M per training run. Only makes sense if domain-adaptation value > cost.
  • Checkpoint cadence + failure recovery matter. 1-month 480-GPU runs will experience hardware failures. Recipes should have checkpoint-every-N-steps + resume-from-checkpoint story; eBay's specifics are not disclosed.
  • Benchmark identity matters. "~25% gain on e-commerce benchmarks" is meaningful only relative to a named benchmark. eBay doesn't name theirs in the blog post; the companion arXiv:2501.09706 presumably does. Beware of unnamed benchmarks in other papers / blog posts.
  • patterns/teacher-student-model-compression — different mechanism: a big teacher distills knowledge into a smaller student via soft-label supervised training, not new-data autoregressive LM. Produces smaller models from bigger ones; doesn't adapt a model to a new domain.
  • patterns/prototype-before-production — small-scale sweep before the big run is a specific instance of this meta-pattern.

Seen in

Last updated · 200 distilled / 1,178 read