CONCEPT Cited by 1 source
Catastrophic forgetting¶
Definition¶
Catastrophic forgetting (or "catastrophic interference") is the failure mode in which a neural network, while being trained on a new task or distribution, loses its ability to perform tasks it previously learned. It is the canonical manifestation of the stability-plasticity trade-off in continual / transfer / continued-pretraining learning regimes.
In the LLM context, catastrophic forgetting shows up as: you take a capable base model, continue-pretrain it on new-domain data, and the resulting model gains on the new domain but regresses on general-domain benchmarks — sometimes dramatically. If the regression is large, the domain-adapted model is no longer a general-purpose model, just a narrow one.
Canonical wiki reference¶
eBay's e-Llama training (2025-01-17) names catastrophic forgetting explicitly as the central challenge of its continued-pretraining recipe off Llama 3.1:
"The goal is to infuse e-commerce specific knowledge into the Llama 3.1 base models, without the models forgetting information they have learned in their original pretraining (an effect sometimes called 'catastrophic forgetting')." (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)
The reported outcome after mitigation: ~25% English / ~30% non-English e-commerce-benchmark gain, with only ~1% general-domain NLU regression for the 70B model. The fact that eBay reports the general-domain regression number at all is the point — tracking forgetting as a first-class metric is the discipline that tells you whether your recipe is working.
How forgetting happens¶
Gradient descent on a new data distribution shifts weights toward the new minimum. The weights that encoded previously-learned capabilities are not explicitly protected — they're just parameters that happened to be in a good spot for the old task. Nothing in the loss landscape prevents them from moving.
Severity scales with:
- Size of the new data (more steps = more drift).
- Size of the learning rate (larger LR = bigger per-step weight movement).
- Distance between old and new distributions (further domain = more pressure to shift).
- Absence of replay (if the new data never contains examples resembling the old distribution, the old capability has no gradient reinforcing it).
This is why continued-pretraining recipes typically use a reduced learning rate (eBay: 10% of Llama 3.1's max LR) and mix in replay data from the original distribution.
Mitigations at LLM scale¶
- Replay training — include examples from the original pretraining distribution in the new mix. eBay ships a 1:1 general-to-domain sampling ratio with the general-domain half drawn from curated / publicly-available / open-source corpora chosen to resemble Llama 3.1's pretraining.
- Reduced learning rate. 10% of base max LR was eBay's small-scale-sweep optimum. Lower LR = smaller weight drift per step = less forgetting.
- Parameter-efficient methods (LoRA, adapters). Train a small additional module and freeze the base. Avoids catastrophic forgetting by construction but caps the depth of domain adaptation.
- Rehearsal / experience replay with explicit previous examples (more common in RL / classical continual-learning; less common at LLM scale since the "old data" is the pretraining corpus, which is typically impractical to re-mix in bulk — curated replay is a practical approximation).
- Explicit regularisation toward base weights (e.g. EWC-style). Less common at frontier-LLM scale due to cost / complexity.
- Monitoring with general-domain benchmarks. Not a mitigation but a prerequisite — you can't tune the mix ratio if you aren't measuring the regression.
Caveats¶
- Some regression is the price of adaptation. The eBay 70B model regresses ~1% on general-domain NLU; the 8B may regress more (not separately reported). Zero-regression is unlikely if the domain is different enough from the base distribution to warrant adaptation at all.
- "General-domain benchmark" is a single number that can hide uneven regression. A model may preserve MMLU-style factual QA while losing code-generation, or vice versa. eBay does not disclose per-benchmark breakdowns.
- Catastrophic forgetting is model-size dependent. Larger models are more robust to continued pretraining in practice — the parameter count provides more "room" for coexisting representations. This is why the eBay 70B's 1% regression may be smaller than an 8B's at the same recipe.
- Replay data quality matters. Curated / high-quality general-domain replay is the countermeasure; noisy general-domain data can itself degrade the model.
Seen in¶
- sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development — named as the central risk continued pretraining has to manage, with 1:1 replay-mix + 10%-base-LR as the mitigation recipe.
Related¶
- concepts/continued-pretraining — the training regime where catastrophic forgetting most commonly surfaces.
- concepts/replay-training — the countermeasure.
- concepts/training-serving-boundary
- systems/e-llama / systems/llama-3-1
- patterns/continued-pretraining-for-domain-adaptation