CONCEPT Cited by 2 sources

Transfer learning¶

Definition¶

Transfer learning is the machine-learning practice of taking a model (or model components) trained on one problem in one data distribution and reusing its learned representations to solve a related problem in a different, often data-scarcer, distribution — rather than initializing the second model with random weights and training from scratch.

The reused artifact can be:

Pre-trained weights of the entire network (e.g., fine-tuning a foundation model for a downstream task).
A subset of layers — most commonly the early embedding / feature-extraction layers — while later task-specific heads are reinitialized and trained.
Continued pretraining on a domain corpus — a milder form where the same objective is continued on different data (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development).
Knowledge distillation from a teacher to a student — the teacher transfers learned probabilities, not weights.

Why it matters¶

Transfer learning is the dominant practice in modern ML because it has two structural advantages over from-scratch training:

Data efficiency: the target task can converge with far less labeled data, because the model isn't learning the primitives of the input space from zero.
Compute efficiency: training time is dramatically shorter when you start from a useful initialization.

In practice, the costs of training a foundation-scale model from scratch are prohibitive for most teams, so the de-facto modern recipe is "start from a strong open base, transfer to your domain".

Subset: domain adaptation¶

Domain adaptive learning (or domain adaptation) is a specific sub-case of transfer learning where:

The task is the same (e.g., CTR prediction is CTR prediction in both domains).
The input distribution differs (e.g., Instacart Marketplace catalog vs partner O&O site catalog).

This contrasts with the broader transfer-learning case where both task and distribution may differ.

Key wiki canonicalisations¶

Instacart Carrot Ads — Domain Adaptive Learning for pCTR¶

(Source: sources/2026-05-04-instacart-empowering-carrot-ads-with-domain-adaptive-learning)

Instacart's framing on transfer learning, applied to retail-media ad CTR:

"At a high level, Domain Adaptive Learning is a subset of transfer learning. It focuses on transferring knowledge gained from solving a problem in a data-rich environment (source domain) to improve performance in a related, often data-scarce environment (target domain)."

Mechanism: shared shopping-context-pre-trained embedding layers, feature transfer, fine-tuning of partner-specific layers, and reuse of dense representations from the wide-and-deep pCTR backbone.

Counter-intuitive property explicitly observed: the domain-adapted model outperforms training from scratch on the target domain even when the target has enough data to converge on its own — because the source-domain (Instacart Marketplace first-party data) contributes signal the target never sees.

eBay e-Llama — continued pretraining on domain data¶

(Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)

A different transfer-learning recipe applied to LLMs: take Llama 3.1, continue-pretrain on a 1:1 mix of e-commerce domain data plus general replay data with carefully tuned hyperparameters, then post-train (instruction tuning + RLHF). See patterns/continued-pretraining-for-domain-adaptation for the end-to-end recipe.

Comparison of transfer-learning recipes¶

Recipe	Granularity	When to use
Continued pretraining (patterns/continued-pretraining-for-domain-adaptation)	All weights, autoregressive LM	Significant new domain knowledge; foundation-model scale; can afford weeks of compute
Shared-embedding warm-start + fine-tune later layers (patterns/cross-domain-warm-start-via-shared-embeddings)	Some layers reused, rest fine-tuned	Multi-tenant/partner deployment where source-domain signal helps every target
LoRA / parameter-efficient fine-tune (concepts/lora-low-rank-adaptation)	Small low-rank adapters	Many lightweight per-customer adapters; foundation model frozen
Supervised fine-tuning (concepts/supervised-fine-tuning)	All weights, supervised loss	Behaviour change from labeled examples
Knowledge distillation (patterns/teacher-student-model-compression)	Teacher → student	Compress capability into a smaller serving model

Failure modes¶

Negative transfer — when source and target domains differ in ways the model can't reconcile, transferred knowledge can degrade target performance below from-scratch training. Mitigated by careful source/target alignment and (in production) human-in-the-loop schema/distribution verification.
Catastrophic forgetting — continued training on new data causes the model to lose general-domain capabilities. Mitigated by replay-training mixes (eBay's 1:1 ratio) and lower learning rates.
Distribution-shift drift — source-domain pre-training becomes stale relative to evolving target distributions; a detection-and-retrain loop is needed.
Embedding incompatibility — fine-tuning later layers can drift far enough that the early shared embeddings become a poor fit, eroding the transfer benefit.

Operating cadences (a recurring transfer-learning question)¶

A real production question for any transfer-learning system is how often each layer of reuse is refreshed:

Pre-trained shared embeddings — refreshed rarely (months / quarters); expensive to re-train; cross-tenant amortized.
Per-target fine-tuned heads — refreshed often (days / weeks); cheap; per-tenant.
Per-target feature configurations — refreshed continuously as feature availability and distributions change.

The two-cadence design is itself a load-bearing property: it amortises the heavy work across all targets while keeping the target-specific tuning fast and cheap.

Seen in¶

sources/2026-05-04-instacart-empowering-carrot-ads-with-domain-adaptive-learning — DAL = subset of transfer learning, applied to multi-partner pCTR cold-start; shared shopping-context embeddings + per- partner fine-tune; outperforms from-scratch training even when target has sufficient data.
sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development — continued pretraining of Llama 3.1 on e-commerce domain data; replay-mix + hyperparameter sweep + post-training (IT + RLHF) recipe.

concepts/domain-adaptive-learning — the specific subset.
concepts/source-and-target-domain — the canonical framing for transfer-learning problems.
concepts/negative-transfer — the gating failure mode.
concepts/continued-pretraining / concepts/catastrophic-forgetting / concepts/replay-training
concepts/lora-low-rank-adaptation / concepts/supervised-fine-tuning
patterns/cross-domain-warm-start-via-shared-embeddings
patterns/continued-pretraining-for-domain-adaptation
patterns/teacher-student-model-compression
systems/instacart-carrot-ads / systems/instacart-carrot-ads-pctr-model / companies/instacart
concepts/cold-start — transfer learning is the dominant tool for new-item / new-user / new-domain cold-start in recsys.