Skip to content

PATTERN Cited by 1 source

Frozen encoder + linear probe

Intent

Train a large self-supervised encoder once, freeze it, and deploy it across many downstream tasks with each task getting its own small linear head fitted on top of the frozen features. The encoder is the shared representation layer; the per-task linear layer is cheap, fast to train, cheap to deploy, and scoped to one task at a time.

The pattern is both an evaluation methodology (linear probe evaluation) and a deployment architecture — in well-designed systems, those two are the same thing so evaluation numbers reflect production behaviour directly.

Canonical instance — Netflix MediaFM

MediaFM is the canonical wiki instance (Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding).

  • Encoder: BERT-style Transformer pre-trained with Masked Shot Modeling on tens-of-millions of shot-level embeddings from Netflix catalog.
  • Frozen after pre-training — downstream training does not update MediaFM's parameters.
  • Linear probes on top for five downstream Netflix tasks: ad relevancy, clip popularity ranking, clip tone classification, clip genre classification, clip retrieval.
  • Each probe is a single task-specific linear layer from the frozen per-shot embedding space to the task's output.
  • Deployment is identical to evaluation: the production ad-relevancy model is literally the frozen MediaFM + an AP- optimised linear layer.

Why this pattern

  • Amortise expensive pre-training across many tasks. The foundation model is expensive to train once; cheap to consume afterwards. MediaFM runs pre-training at Netflix-catalog scale, then every team building a new classifier gets a 2304-dim per-shot feature vector + trains their own linear layer.
  • Decouple team timelines. The ad team, popularity team, tone team, genre team, and retrieval team don't block on each other — they all consume the same frozen encoder independently.
  • Cheaper than per-task fine-tuning. Fine-tuning the full backbone per task multiplies GPU-memory + training-wall-clock cost by the number of tasks; linear probes multiply by ~nothing.
  • Maintainability. Upgrade the encoder once → all downstream tasks benefit (after re-running the per-task linear fits). Upgrading a per-task fine-tune requires coordinated retraining across teams.
  • Eval-production parity. When your evaluation methodology is the deployment architecture, there's no train-serving skew risk from "evaluated with linear probes, deployed with full fine-tuning" asymmetry.

Tradeoffs

  • Cost — upper-bound on task-specific quality. A linear head can only exploit linear separability of the frozen features. Some tasks with strong non-linear interactions (rare event detection, highly compositional classification) may need MLP heads or partial fine-tuning for last-mile quality.
  • Cost — encoder weaknesses propagate. If the frozen encoder is biased or missing a concept, no downstream linear layer can paper over it. Fixing requires re-pre-training + full re-probe rollout.
  • Cost — re-probe on encoder upgrade. Swapping MediaFM-v1 for MediaFM-v2 means every downstream task's linear layer is obsolete — coordinated refit required across teams.
  • Win — feature reuse. Once the per-title MediaFM inference cache is populated, every downstream task reads from the same cache; no per-task inference cost beyond the tiny linear head.

When to fit the probe non-linearly

Some reported cases promote linear probes to small MLP probes ("MLP probe") or attention-pooled heads — the escape valve when linear separability is insufficient. This keeps the pattern (frozen encoder, task-specific small head) while acknowledging some tasks need a little more expressivity than pure linear.

MediaFM sticks with linear across its five tasks, reportedly successfully.

Relationship to "full fine-tuning"

Frozen + linear probe Full fine-tune
Training cost per task Minutes (linear fit) Hours-days (backbone update)
GPU memory per task Tiny (frozen activations cached) Full model
Per-task quality ceiling Bounded by linear separability of features Task-specific ceiling
Deployability Single backbone, many heads Many full models
Maintainability Central upgrade Distributed upgrade

For Netflix's catalogue-scale production with many clip-level downstream teams, the frozen + linear pattern is a deliberate architectural choice trading some per-task ceiling for massive operational-overhead savings.

Implementation checklist

  • Pre-train once. Large self-supervised run; freeze on completion.
  • Cache per-input features. MediaFM caches per-title contextualised shot embeddings; downstream tasks read from the cache, never re-invoke the encoder for the same input.
  • Per-task linear fit. Each downstream team trains their own linear head on their own labels with their own metric.
  • Evaluation in the deployment architecture. Don't report fine-tune numbers then ship linear-probe inference or vice versa — match evaluation methodology to production architecture to avoid train-serving skew.
  • Plan for encoder upgrades. Budget coordinated re-probe effort across teams when a new encoder version ships.

When this fits

  • Many downstream tasks, similar input domain. MediaFM's five Netflix tasks all consume the same shot-embedding input from the same catalog.
  • Linear separability is plausible (verifiable empirically).
  • Operational cost matters — GPU fleet + team coordination budget constrain full fine-tuning rollouts.
  • Eval-production parity matters — regulated / high-stakes domains where "evaluation numbers must reflect production behaviour" is a requirement, not a nice-to-have.

When it doesn't fit

  • One downstream task only. Full fine-tuning may win on absolute quality; the frozen-encoder investment isn't amortised.
  • Highly non-linear task structure. MLP probes at minimum, or full fine-tuning, needed.
  • Backbone needs task-specific adaptation. Some domain shifts (medical imaging on a web-image-pre-trained CLIP) need the backbone to learn the new distribution, which linear probes can't accomplish.

Caveats

  • Netflix doesn't describe the per-task linear-layer regularisation, training regime, or held-out evaluation protocol.
  • The pattern assumes stable upstream — if MediaFM's per-shot features shift between runs (instability in upstream wav2vec2 / SeqCLIP / OpenAI API), downstream linear heads can silently degrade.
  • OpenAI API dependency in MediaFM means the "frozen encoder" is partially a moving target — text-embedding-3-large's behaviour can change if OpenAI updates it without a version change.

Seen in

Last updated · 319 distilled / 1,201 read