Skip to content

CONCEPT Cited by 1 source

Replay training

Definition

Replay training (also: "rehearsal", "experience replay" in a related RL sense) is the technique of including examples from a model's previous training distribution in the new training mix to prevent the model from losing previously-learned capabilities when it's being trained on a new domain or task.

In the continued-pretraining-of-LLMs context, replay = mix general-domain data alongside new-domain data. The general-domain data acts as a gradient signal reinforcing the old behaviour while the new-domain data teaches new knowledge.

Canonical wiki reference

eBay's e-Llama continued-pretraining recipe names replay explicitly as the countermeasure to catastrophic forgetting:

"To achieve this, we include some examples in our training data mixture which are close to the examples the models have originally been pre-trained on. This 'replay' has been shown to help a model retain previously learned information. The aforementioned examples are drawn from a mixture of curated, publicly available and open-source datasets, and smaller but more high quality datasets. We also include 10% non-English language general domain data to further enhance the model's multilingual capabilities." (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)

The eBay ratio: 1 : 1 general-to-domain, with general-side composition = curated / publicly available / open-source + smaller high-quality sets + 10% non-English.

How replay works

Each training step updates model weights via gradient descent on the mini-batch. If the mini-batch contains only new-domain data, gradients point exclusively toward the new distribution's loss minimum — nothing reinforces the old distribution. Weights drift; previously-learned capability erodes.

With replay, each step (or each batch, depending on mixing strategy) contains a blend: part of the gradient signal pulls toward new-domain knowledge, part reinforces existing capability. The trade-off lives in the mix ratio:

  • Too little replay → catastrophic forgetting; large general-domain regression.
  • Too much replay → the domain-specific signal is diluted; the model doesn't learn enough new knowledge to justify the exercise.
  • 1:1 (eBay's optimum for e-commerce) → the balance that surfaced from a small-scale sweep at the Llama 3.1 base.

Replay data composition matters

Not any general-domain data works equally well. The key property:

"examples... which are close to the examples the models have originally been pre-trained on"

I.e. replay data should resemble the base model's pretraining distribution, not just "be general-domain." Random general-text scrapes may teach slightly different representations than the base; curated / publicly available / open-source datasets chosen to resemble the base's own curated corpus are what actually reinforce existing weights.

This is also why eBay pairs replay with an e-commerce classifier that mines domain-specific examples from a larger open-source dataset — expanding the domain side of the mix without over-fitting to the stylistic narrowness of pure eBay listings/reviews.

Design knobs

  • Ratio. 1:1 is the eBay optimum at 70B / Llama 3.1 / 1T tokens / e-commerce. Other domains and bases may differ.
  • Granularity. Per-batch mixing (each batch is blended) vs per-epoch (alternating epochs of new vs replay). Per-batch is more common at LLM scale; per-epoch is lower-overhead but has sharper forgetting transients.
  • Source curation. Resemble-the-base is the principle. For Llama-class continued pretraining, corpora like RedPajama, The Pile, FineWeb or similar curated open-source re-collections serve as tractable approximations of the base's pretraining distribution.
  • Multilingual share. 10% non-English in eBay's mix preserves / enhances multilingual capability — side objective separate from the domain goal.

Caveats

  • Replay is not a cure. Even with 1:1 replay, eBay reports ~1% general-domain regression on the 70B. Replay reduces forgetting; it does not eliminate it.
  • Replay costs steps. At a fixed training-budget, every replay example is a step not spent on domain data. If the domain fine-tune needs 100B tokens of truly new signal, you need to budget 200B tokens total to achieve that at 1:1.
  • Base-distribution access is typically approximate. For a model like Llama 3.1, the exact pretraining corpus isn't public; teams approximate it with open-source curations that resemble it.

Seen in

Last updated · 200 distilled / 1,178 read