Skip to content

CONCEPT Cited by 1 source

Catastrophic forgetting

Definition

Catastrophic forgetting (or "catastrophic interference") is the failure mode in which a neural network, while being trained on a new task or distribution, loses its ability to perform tasks it previously learned. It is the canonical manifestation of the stability-plasticity trade-off in continual / transfer / continued-pretraining learning regimes.

In the LLM context, catastrophic forgetting shows up as: you take a capable base model, continue-pretrain it on new-domain data, and the resulting model gains on the new domain but regresses on general-domain benchmarks — sometimes dramatically. If the regression is large, the domain-adapted model is no longer a general-purpose model, just a narrow one.

Canonical wiki reference

eBay's e-Llama training (2025-01-17) names catastrophic forgetting explicitly as the central challenge of its continued-pretraining recipe off Llama 3.1:

"The goal is to infuse e-commerce specific knowledge into the Llama 3.1 base models, without the models forgetting information they have learned in their original pretraining (an effect sometimes called 'catastrophic forgetting')." (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)

The reported outcome after mitigation: ~25% English / ~30% non-English e-commerce-benchmark gain, with only ~1% general-domain NLU regression for the 70B model. The fact that eBay reports the general-domain regression number at all is the point — tracking forgetting as a first-class metric is the discipline that tells you whether your recipe is working.

How forgetting happens

Gradient descent on a new data distribution shifts weights toward the new minimum. The weights that encoded previously-learned capabilities are not explicitly protected — they're just parameters that happened to be in a good spot for the old task. Nothing in the loss landscape prevents them from moving.

Severity scales with:

  • Size of the new data (more steps = more drift).
  • Size of the learning rate (larger LR = bigger per-step weight movement).
  • Distance between old and new distributions (further domain = more pressure to shift).
  • Absence of replay (if the new data never contains examples resembling the old distribution, the old capability has no gradient reinforcing it).

This is why continued-pretraining recipes typically use a reduced learning rate (eBay: 10% of Llama 3.1's max LR) and mix in replay data from the original distribution.

Mitigations at LLM scale

  1. Replay training — include examples from the original pretraining distribution in the new mix. eBay ships a 1:1 general-to-domain sampling ratio with the general-domain half drawn from curated / publicly-available / open-source corpora chosen to resemble Llama 3.1's pretraining.
  2. Reduced learning rate. 10% of base max LR was eBay's small-scale-sweep optimum. Lower LR = smaller weight drift per step = less forgetting.
  3. Parameter-efficient methods (LoRA, adapters). Train a small additional module and freeze the base. Avoids catastrophic forgetting by construction but caps the depth of domain adaptation.
  4. Rehearsal / experience replay with explicit previous examples (more common in RL / classical continual-learning; less common at LLM scale since the "old data" is the pretraining corpus, which is typically impractical to re-mix in bulk — curated replay is a practical approximation).
  5. Explicit regularisation toward base weights (e.g. EWC-style). Less common at frontier-LLM scale due to cost / complexity.
  6. Monitoring with general-domain benchmarks. Not a mitigation but a prerequisite — you can't tune the mix ratio if you aren't measuring the regression.

Caveats

  • Some regression is the price of adaptation. The eBay 70B model regresses ~1% on general-domain NLU; the 8B may regress more (not separately reported). Zero-regression is unlikely if the domain is different enough from the base distribution to warrant adaptation at all.
  • "General-domain benchmark" is a single number that can hide uneven regression. A model may preserve MMLU-style factual QA while losing code-generation, or vice versa. eBay does not disclose per-benchmark breakdowns.
  • Catastrophic forgetting is model-size dependent. Larger models are more robust to continued pretraining in practice — the parameter count provides more "room" for coexisting representations. This is why the eBay 70B's 1% regression may be smaller than an 8B's at the same recipe.
  • Replay data quality matters. Curated / high-quality general-domain replay is the countermeasure; noisy general-domain data can itself degrade the model.

Seen in

Last updated · 200 distilled / 1,178 read