Scaling Large Language Models for e-Commerce: The Development of a Llama-Based Customized LLM¶

Summary¶

eBay's 2025-01-17 post describes the training-infrastructure and data-mix design behind e-Llama — 8-billion and 70-billion parameter LLMs adapted from Meta's Llama 3.1 base models via continued pretraining on a mix of eBay-proprietary and general-domain data. Positioned as the adapt-existing-model arm of a hybrid LLM strategy that runs alongside fully-from-scratch models (the LiLiuM family referenced out to arXiv:2406.12023). The core architectural content is the training topology at 480 GPUs: 60 nodes × 8 NVIDIA H100 80GB GPUs, connected intra-node via NVIDIA NVLink and inter-node via InfiniBand, orchestrated by Megatron-LM with concepts/3d-parallelism|3D parallelism (data + tensor + pipeline) plus distributed optimizer states and flash-attention-2. Concrete scale: 1 trillion tokens, ~85k update steps, batch size ~11.8M tokens, ~1 month wall-clock on 70B, ~340k GPU-hours. Continued-pretraining hyperparameters were determined by small-scale sweep: optimal max learning rate = 10% of the original pretraining max, optimal general-to-e-commerce sampling ratio = 1:1, cosine schedule with warmup. Outcome: ~25% improvement on English e-commerce benchmarks, ~30% on non-English, with ~1% degradation on general-domain NLU benchmarks for the 70B variant. Final models are then instruction-tuned + RLHF-aligned.

The post's architectural signal is the LR-ratio + data-mix ratio as the load-bearing control knobs for continued pretraining at scale (the 10% / 1:1 numbers are specific enough to be useful as a baseline for other teams), the 480-GPU Megatron-LM 3D-parallelism topology (distinct from the inference-side tensor/pipeline parallelism patterns already in the wiki — this is training-side), and the replay mechanism as the architectural answer to catastrophic forgetting. The post is training-infrastructure heavy and serving-infrastructure absent — no inference backend, no per-request latency, no QPS, no cost-per-token — consistent with its framing as a training retrospective rather than a serving deep-dive.

Key takeaways¶

Hybrid LLM strategy: build from scratch AND adapt. eBay runs two parallel tracks: (a) LiLiuM family — full-from-scratch e-commerce LLMs with end-to-end control over license, data, architecture, vocabulary (referenced out to arXiv:2406.12023); (b) e-Llama — adapt Llama 3.1 via continued pretraining to move faster and unlock value without rebuilding the base. "The combination of open and proprietary models gives us the best of both worlds to achieve fine-tuned, scalable, and cost-effective solutions that are tuned to e-commerce applications." (Source: this article)
Continued pretraining at 1T tokens on Llama 3.1 base. 8B + 70B models adapted via autoregressive-LM continued pretraining on a mixture of filtered + serialized public eBay listings/reviews plus general-domain replay data (curated open-source corpora) plus 10% non-English data to enhance multilingual capability. An e-commerce classifier is trained to extract e-commerce-relevant examples from a larger open-source dataset. "The goal is to infuse e-commerce specific knowledge into the Llama 3.1 base models, without the models forgetting information they have learned in their original pretraining." (Source: this article)
Catastrophic forgetting is managed via replay. The central continued-pretraining trade-off: push the model toward the new domain without eroding general-domain capability. eBay's answer: include examples from the original pretraining distribution (curated/publicly-available/open-source + high-quality smaller datasets) in the new mix. "This 'replay' has been shown to help a model retain previously learned information." (Source: this article)
480-GPU Megatron-LM 3D-parallelism training topology. 60 nodes × 8 × NVIDIA H100 80GB = 480 H100s total. Intra-node: NVLink. Inter-node: InfiniBand. Framework: Megatron-LM. Parallelism axes used together: data parallel (DP), tensor parallel (TP), pipeline parallel (PP), plus distributed optimizer states + flash-attention-2 + "among other optimizations." "Model training at this scale requires an efficient distribution of model and optimizer states across several GPUs and sometimes even across nodes." (Source: this article)
Hyperparameters determined by small-scale sweep before full run. Key result: max LR = 10% of the original Llama 3.1 max LR; general-to-e-commerce data sampling ratio = 1:1. These two knobs were identified as optimal via a series of small-scale experiments, then scaled to the full run. Reminder that continued-pretraining at frontier scale is still governed by empirical small-scale design-of-experiments, not by closed-form theory. (Source: this article)
Concrete training cost: 1T tokens, ~85k steps, ~11.8M tokens/batch, ~1 month wall-clock on 70B, ~340k GPU-hours. Cosine LR schedule with warmup. 70B model trained on 1 trillion tokens took ~1 month and ~340,000 GPU-hours. "Comparing these numbers to what has been reported for the Llama 2 base model training, we find our setup to be even more efficient." (Source: this article) — Note: the comparison is against Llama 2, not Llama 3.1, and no MFU number or tokens-per-GPU-per-second number is disclosed.
Benchmark outcome: ~25% English gain, ~30% non-English gain, ~1% general-domain regression (70B). On e-commerce-specific benchmarks: ~25% improvement for English, ~30% for non-English vs the Llama 3.1 base. On general-domain NLU: only ~1% degradation for the 70B variant. Consistent with the stated replay-training objective. "At the same time, we observe only 1% degradation on general domain NLU benchmarks for the large e-Llama 70B model." (Source: this article)
Instruction tuning + RLHF alignment post-pretraining. "After pretraining, we further instruction-tuned the models, aligning them with human feedback to ensure they generated safe and contextually appropriate content. This tuning also helped the models learn guardrails and follow explicit instructions, enhancing their practical application." (Source: this article) — No disclosure of the RLHF reward-model architecture, preference-data volume, PPO vs DPO vs another variant, or safety-evaluation harness.

Operational numbers¶

Metric	Value
Base model	Meta Llama 3.1 (8B + 70B)
Training tokens	1 trillion total
Training hardware	60 nodes × 8 × NVIDIA H100 80GB = 480 H100 GPUs
Intra-node interconnect	NVIDIA NVLink
Inter-node interconnect	InfiniBand
Training framework	Megatron-LM
Parallelism	3D (data + tensor + pipeline) + distributed optimizer + flash-attention-2
Batch size	~11.8M tokens
Update steps	~85,000
Max learning rate	10% of Llama 3.1 base max LR
LR schedule	cosine with warmup
Data mix (general : e-commerce)	1 : 1
Non-English share	10% of general data
70B wall-clock	~1 month
70B GPU-hours	~340,000
English e-commerce benchmark gain	~25%
Non-English e-commerce benchmark gain	~30%
General-domain NLU regression (70B)	~1%

Systems / concepts / patterns extracted¶

Systems:

systems/e-llama — eBay's Llama-3.1-derived 8B + 70B continued-pretrained e-commerce LLM.
systems/llama-3-1 — Meta's open-weights Llama 3.1 base family; eBay's adaptation target.
systems/megatron-lm — NVIDIA's highly-optimised LLM training framework supporting 3D parallelism.
systems/flash-attention-2 — memory-efficient + IO-aware attention kernel used in the e-Llama training recipe.
systems/nvidia-h100 — the 80GB SXM GPUs that make up the 480-GPU training cluster.
systems/nvlink — NVIDIA's high-bandwidth intra-node GPU interconnect.
systems/infiniband — the inter-node fabric connecting the 60 nodes.

Concepts:

concepts/continued-pretraining — the core technique: keep training an existing base model on new-domain data to infuse domain knowledge without a from-scratch build.
concepts/catastrophic-forgetting — the failure mode continued pretraining has to manage: new-domain gains eroding general-domain skill.
concepts/3d-parallelism — DP + TP + PP composed for billion-parameter training at multi-node scale.
concepts/data-parallelism — replicating the model across workers, sharding the data.
concepts/tensor-parallelism — splitting weight matrices across GPUs.
concepts/pipeline-parallelism — splitting layers across GPUs.
concepts/replay-training — including examples from the original pretraining distribution in the new mix to resist catastrophic forgetting.
concepts/training-serving-boundary — this post sits firmly on the training side; serving is absent.

Patterns:

patterns/continued-pretraining-for-domain-adaptation — the full recipe: pick a capable open base, include replay data, tune LR at ~10% of base, balance general:domain ~1:1, scale on a Megatron-LM 3D-parallel cluster, validate on both domain-specific and general benchmarks to monitor regression, then instruction-tune + RLHF.

Caveats¶

No serving-infrastructure content. Inference backend, per-query latency, QPS, cost-per-token, per-user routing, multi-region serving — none disclosed. This is a training retrospective; the deployment of e-Llama into eBay buy/sell product surfaces is out of scope.
Model FLOPs Utilisation (MFU) not disclosed. The "more efficient than Llama 2" claim is unquantified (no tokens-per-second-per-GPU number, no MFU percentage, no comparison methodology).
Parallelism degrees not disclosed. DP, TP, PP values that partition the 480 GPUs are not stated.
Data-volume numbers are relative, not absolute. No disclosure of how many tokens of eBay listings vs reviews vs extracted open-source examples, no disclosure of filtering/dedup/tokenization specifics.
Instruction-tuning + RLHF details opaque. No reward-model architecture, preference-data volume, optimization algorithm (PPO / DPO / IPO / KTO), or safety-evaluation methodology.
Evaluation benchmark identity not named. "E-commerce-specific benchmarks" and "general domain NLU benchmarks" are abstract — no named benchmarks (is it MMLU? HELM? internal? the LiLiuM eval suite?) disclosed, which limits ability to cross-compare with other domain-adapted LLMs.
Classifier-based e-commerce data extraction is underspecified. eBay trains a classifier on "e-commerce-specific examples" to mine a larger open-source corpus, but the classifier architecture, training data, precision/recall numbers, and downstream impact are not disclosed.
Checkpoint cadence, resume-from-failure mechanics, per-step-time, and communication-overhead numbers absent. For a 1-month 480-GPU run, failure-recovery architecture is load-bearing but not discussed.
Pairs as reference rather than as retrospective. A linked arXiv:2501.09706 paper ("Domain Adaptation of Foundation LLMs for e-Commerce") carries the detailed methodology; this blog post is an executive summary.

Source¶

Original: https://innovation.ebayinc.com/stories/scaling-large-language-models-for-e-commerce-the-development-of-a-llama-based-customized-llm-for-e-commerce/
Companion paper: arXiv:2501.09706 — "Domain Adaptation of Foundation LLMs for e-Commerce"
LiLiuM paper (sister track): arXiv:2406.12023 — "LiLiuM: eBay's Large Language Models for e-commerce"
Raw markdown: raw/ebay/2025-01-17-scaling-large-language-models-for-e-commerce-the-development-8b608fab.md