CONCEPT Cited by 2 sources
Transfer learning¶
Definition¶
Transfer learning is the machine-learning practice of taking a model (or model components) trained on one problem in one data distribution and reusing its learned representations to solve a related problem in a different, often data-scarcer, distribution — rather than initializing the second model with random weights and training from scratch.
The reused artifact can be:
- Pre-trained weights of the entire network (e.g., fine-tuning a foundation model for a downstream task).
- A subset of layers — most commonly the early embedding / feature-extraction layers — while later task-specific heads are reinitialized and trained.
- Continued pretraining on a domain corpus — a milder form where the same objective is continued on different data (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development).
- Knowledge distillation from a teacher to a student — the teacher transfers learned probabilities, not weights.
Why it matters¶
Transfer learning is the dominant practice in modern ML because it has two structural advantages over from-scratch training:
- Data efficiency: the target task can converge with far less labeled data, because the model isn't learning the primitives of the input space from zero.
- Compute efficiency: training time is dramatically shorter when you start from a useful initialization.
In practice, the costs of training a foundation-scale model from scratch are prohibitive for most teams, so the de-facto modern recipe is "start from a strong open base, transfer to your domain".
Subset: domain adaptation¶
Domain adaptive learning (or domain adaptation) is a specific sub-case of transfer learning where:
- The task is the same (e.g., CTR prediction is CTR prediction in both domains).
- The input distribution differs (e.g., Instacart Marketplace catalog vs partner O&O site catalog).
This contrasts with the broader transfer-learning case where both task and distribution may differ.
Key wiki canonicalisations¶
Instacart Carrot Ads — Domain Adaptive Learning for pCTR¶
(Source: sources/2026-05-04-instacart-empowering-carrot-ads-with-domain-adaptive-learning)
Instacart's framing on transfer learning, applied to retail-media ad CTR:
"At a high level, Domain Adaptive Learning is a subset of transfer learning. It focuses on transferring knowledge gained from solving a problem in a data-rich environment (source domain) to improve performance in a related, often data-scarce environment (target domain)."
Mechanism: shared shopping-context-pre-trained embedding layers, feature transfer, fine-tuning of partner-specific layers, and reuse of dense representations from the wide-and-deep pCTR backbone.
Counter-intuitive property explicitly observed: the domain-adapted model outperforms training from scratch on the target domain even when the target has enough data to converge on its own — because the source-domain (Instacart Marketplace first-party data) contributes signal the target never sees.
eBay e-Llama — continued pretraining on domain data¶
(Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)
A different transfer-learning recipe applied to LLMs: take Llama 3.1, continue-pretrain on a 1:1 mix of e-commerce domain data plus general replay data with carefully tuned hyperparameters, then post-train (instruction tuning + RLHF). See patterns/continued-pretraining-for-domain-adaptation for the end-to-end recipe.
Comparison of transfer-learning recipes¶
| Recipe | Granularity | When to use |
|---|---|---|
| Continued pretraining (patterns/continued-pretraining-for-domain-adaptation) | All weights, autoregressive LM | Significant new domain knowledge; foundation-model scale; can afford weeks of compute |
| Shared-embedding warm-start + fine-tune later layers (patterns/cross-domain-warm-start-via-shared-embeddings) | Some layers reused, rest fine-tuned | Multi-tenant/partner deployment where source-domain signal helps every target |
| LoRA / parameter-efficient fine-tune (concepts/lora-low-rank-adaptation) | Small low-rank adapters | Many lightweight per-customer adapters; foundation model frozen |
| Supervised fine-tuning (concepts/supervised-fine-tuning) | All weights, supervised loss | Behaviour change from labeled examples |
| Knowledge distillation (patterns/teacher-student-model-compression) | Teacher → student | Compress capability into a smaller serving model |
Failure modes¶
- Negative transfer — when source and target domains differ in ways the model can't reconcile, transferred knowledge can degrade target performance below from-scratch training. Mitigated by careful source/target alignment and (in production) human-in-the-loop schema/distribution verification.
- Catastrophic forgetting — continued training on new data causes the model to lose general-domain capabilities. Mitigated by replay-training mixes (eBay's 1:1 ratio) and lower learning rates.
- Distribution-shift drift — source-domain pre-training becomes stale relative to evolving target distributions; a detection-and-retrain loop is needed.
- Embedding incompatibility — fine-tuning later layers can drift far enough that the early shared embeddings become a poor fit, eroding the transfer benefit.
Operating cadences (a recurring transfer-learning question)¶
A real production question for any transfer-learning system is how often each layer of reuse is refreshed:
- Pre-trained shared embeddings — refreshed rarely (months / quarters); expensive to re-train; cross-tenant amortized.
- Per-target fine-tuned heads — refreshed often (days / weeks); cheap; per-tenant.
- Per-target feature configurations — refreshed continuously as feature availability and distributions change.
The two-cadence design is itself a load-bearing property: it amortises the heavy work across all targets while keeping the target-specific tuning fast and cheap.
Seen in¶
- sources/2026-05-04-instacart-empowering-carrot-ads-with-domain-adaptive-learning — DAL = subset of transfer learning, applied to multi-partner pCTR cold-start; shared shopping-context embeddings + per- partner fine-tune; outperforms from-scratch training even when target has sufficient data.
- sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development — continued pretraining of Llama 3.1 on e-commerce domain data; replay-mix + hyperparameter sweep + post-training (IT + RLHF) recipe.
Related¶
- concepts/domain-adaptive-learning — the specific subset.
- concepts/source-and-target-domain — the canonical framing for transfer-learning problems.
- concepts/negative-transfer — the gating failure mode.
- concepts/continued-pretraining / concepts/catastrophic-forgetting / concepts/replay-training
- concepts/lora-low-rank-adaptation / concepts/supervised-fine-tuning
- patterns/cross-domain-warm-start-via-shared-embeddings
- patterns/continued-pretraining-for-domain-adaptation
- patterns/teacher-student-model-compression
- systems/instacart-carrot-ads / systems/instacart-carrot-ads-pctr-model / companies/instacart
- concepts/cold-start — transfer learning is the dominant tool for new-item / new-user / new-domain cold-start in recsys.