Skip to content

SYSTEM Cited by 1 source

e-Llama

e-Llama is eBay's family of Llama-3.1-derived 8-billion and 70-billion parameter LLMs, adapted to the e-commerce domain via continued pretraining on a 1-trillion-token mixture of eBay-proprietary and general-domain data, followed by instruction tuning and RLHF alignment. (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)

Positioning

e-Llama is the adapt-existing-model arm of eBay's two-track LLM strategy. The other arm is the LiLiuM family — fully-from-scratch e-commerce LLMs. The stated rationale for running both:

"Training a large-scale LLM from scratch is a very time- and resource-intensive process. In order to move fast, one could use existing pretrained models, such as Llama 3.1, for their use cases. However, these models typically lack specific knowledge, in our case about the e-commerce domain."

e-Llama's role is the velocity lane: "move faster and unlock more value, as we do not have to develop the models completely from scratch." (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)

Training recipe

  • Base: Llama 3.1 8B + 70B.
  • Technique: continued pretraining (autoregressive LM objective).
  • Training tokens: 1 trillion total.
  • Data mix: 1:1 general-to-e-commerce sampling ratio. General data includes replay from curated/public/open-source corpora (to resist catastrophic forgetting) plus 10% non-English general data. E-commerce data = filtered + serialized eBay listings and product reviews plus a classifier-extracted e-commerce subset mined from a larger open-source dataset.
  • Hyperparameters (small-scale-sweep-determined): max LR = 10% of the original Llama 3.1 max LR, cosine schedule with warmup, batch size ~11.8M tokens, ~85,000 update steps.
  • Post-training: instruction tuning + RLHF alignment.

Training topology

Benchmark outcome

  • E-commerce benchmarks: ~25% improvement (English), ~30% improvement (non-English) vs the Llama 3.1 base.
  • General-domain NLU: ~1% degradation for the 70B variant (small regression is the intended trade-off of continued pretraining and is the reason replay is in the mix).

The specific benchmark identities are not named in the blog post. (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)

What's not disclosed

  • Serving stack. No inference backend (vLLM / TGI / SGLang / Megatron / in-house?), no per-query latency, no QPS, no cost-per-token, no multi-region topology, no per-product-surface integration detail.
  • Parallelism degrees (DP, TP, PP values) across the 480 GPUs.
  • Model FLOPs Utilisation (MFU) — "more efficient than Llama 2" is unquantified.
  • Instruction-tuning + RLHF specifics (preference-data volume, PPO vs DPO vs variant, reward-model architecture, safety-eval harness).
  • Checkpoint / failure-recovery mechanics for the ~1-month 480-GPU run.
  • Classifier-based e-commerce-data extractor — architecture, training data, precision/recall unknown.

Seen in

Last updated · 200 distilled / 1,178 read