Skip to content

CONCEPT Cited by 2 sources

Transfer learning

Definition

Transfer learning is the machine-learning practice of taking a model (or model components) trained on one problem in one data distribution and reusing its learned representations to solve a related problem in a different, often data-scarcer, distribution — rather than initializing the second model with random weights and training from scratch.

The reused artifact can be:

  • Pre-trained weights of the entire network (e.g., fine-tuning a foundation model for a downstream task).
  • A subset of layers — most commonly the early embedding / feature-extraction layers — while later task-specific heads are reinitialized and trained.
  • Continued pretraining on a domain corpus — a milder form where the same objective is continued on different data (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development).
  • Knowledge distillation from a teacher to a student — the teacher transfers learned probabilities, not weights.

Why it matters

Transfer learning is the dominant practice in modern ML because it has two structural advantages over from-scratch training:

  1. Data efficiency: the target task can converge with far less labeled data, because the model isn't learning the primitives of the input space from zero.
  2. Compute efficiency: training time is dramatically shorter when you start from a useful initialization.

In practice, the costs of training a foundation-scale model from scratch are prohibitive for most teams, so the de-facto modern recipe is "start from a strong open base, transfer to your domain".

Subset: domain adaptation

Domain adaptive learning (or domain adaptation) is a specific sub-case of transfer learning where:

  • The task is the same (e.g., CTR prediction is CTR prediction in both domains).
  • The input distribution differs (e.g., Instacart Marketplace catalog vs partner O&O site catalog).

This contrasts with the broader transfer-learning case where both task and distribution may differ.

Key wiki canonicalisations

Instacart Carrot Ads — Domain Adaptive Learning for pCTR

(Source: sources/2026-05-04-instacart-empowering-carrot-ads-with-domain-adaptive-learning)

Instacart's framing on transfer learning, applied to retail-media ad CTR:

"At a high level, Domain Adaptive Learning is a subset of transfer learning. It focuses on transferring knowledge gained from solving a problem in a data-rich environment (source domain) to improve performance in a related, often data-scarce environment (target domain)."

Mechanism: shared shopping-context-pre-trained embedding layers, feature transfer, fine-tuning of partner-specific layers, and reuse of dense representations from the wide-and-deep pCTR backbone.

Counter-intuitive property explicitly observed: the domain-adapted model outperforms training from scratch on the target domain even when the target has enough data to converge on its own — because the source-domain (Instacart Marketplace first-party data) contributes signal the target never sees.

eBay e-Llama — continued pretraining on domain data

(Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)

A different transfer-learning recipe applied to LLMs: take Llama 3.1, continue-pretrain on a 1:1 mix of e-commerce domain data plus general replay data with carefully tuned hyperparameters, then post-train (instruction tuning + RLHF). See patterns/continued-pretraining-for-domain-adaptation for the end-to-end recipe.

Comparison of transfer-learning recipes

Recipe Granularity When to use
Continued pretraining (patterns/continued-pretraining-for-domain-adaptation) All weights, autoregressive LM Significant new domain knowledge; foundation-model scale; can afford weeks of compute
Shared-embedding warm-start + fine-tune later layers (patterns/cross-domain-warm-start-via-shared-embeddings) Some layers reused, rest fine-tuned Multi-tenant/partner deployment where source-domain signal helps every target
LoRA / parameter-efficient fine-tune (concepts/lora-low-rank-adaptation) Small low-rank adapters Many lightweight per-customer adapters; foundation model frozen
Supervised fine-tuning (concepts/supervised-fine-tuning) All weights, supervised loss Behaviour change from labeled examples
Knowledge distillation (patterns/teacher-student-model-compression) Teacher → student Compress capability into a smaller serving model

Failure modes

  • Negative transfer — when source and target domains differ in ways the model can't reconcile, transferred knowledge can degrade target performance below from-scratch training. Mitigated by careful source/target alignment and (in production) human-in-the-loop schema/distribution verification.
  • Catastrophic forgetting — continued training on new data causes the model to lose general-domain capabilities. Mitigated by replay-training mixes (eBay's 1:1 ratio) and lower learning rates.
  • Distribution-shift drift — source-domain pre-training becomes stale relative to evolving target distributions; a detection-and-retrain loop is needed.
  • Embedding incompatibility — fine-tuning later layers can drift far enough that the early shared embeddings become a poor fit, eroding the transfer benefit.

Operating cadences (a recurring transfer-learning question)

A real production question for any transfer-learning system is how often each layer of reuse is refreshed:

  • Pre-trained shared embeddings — refreshed rarely (months / quarters); expensive to re-train; cross-tenant amortized.
  • Per-target fine-tuned heads — refreshed often (days / weeks); cheap; per-tenant.
  • Per-target feature configurations — refreshed continuously as feature availability and distributions change.

The two-cadence design is itself a load-bearing property: it amortises the heavy work across all targets while keeping the target-specific tuning fast and cheap.

Seen in

Last updated · 542 distilled / 1,571 read