Skip to content

CONCEPT Cited by 1 source

Linear probe evaluation

Definition

Linear probe evaluation is a standard evaluation methodology for self-supervised foundation models where:

  1. The foundation model's backbone is frozen — its parameters are not updated during downstream training.
  2. For each downstream task, a linear layer (or sometimes a small MLP head) is trained on top of the frozen features, mapping them to the task's output (class, score, rank).
  3. The task metric achieved by this linear-head + frozen-backbone combination is taken as a proxy for the quality of the representations the foundation model has learnt.

The methodology is attractive because:

  • It isolates the backbone's representational quality from downstream architecture or fine-tuning engineering. Two models can be compared on the same linear probe; differences in probe metric reflect differences in the learned features.
  • Cheap — one linear layer per task vs fine-tuning the full backbone per task.
  • Standard — used by CLIP, SimCLR, MAE, BERT, DINOv2, VideoMAE, VideoPrism, and most self-supervised works.

(Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding)

Canonical wiki reference — MediaFM

Netflix's MediaFM uses linear probes as its sole evaluation posture across five downstream Netflix tasks:

"To evaluate the learned embeddings, we learn task-specific linear layers on top of frozen representations (i.e., linear probes)."

Five tasks, each with a task-appropriate metric:

  1. Ad Relevancy — multi-label AP.
  2. Clip Popularity Ranking — 10-fold Kendall's τ.
  3. Clip Tone — micro AP over 100 categories.
  4. Clip Genre — macro AP over 11 categories.
  5. Clip Retrieval — AP on 1:3 pos:neg binary "clip-worthy".

MediaFM's encoder is frozen; only the five linear layers are trained per task. The comparison to baselines (other video embeddings — VertexAI MM, Marengo, internal baselines) uses the same linear-probe harness — enforces apples-to-apples representational comparison.

Why linear specifically

  • If the backbone encodes the signal, a linear classifier suffices. Linear separability of a representation is a strong indicator that the backbone has captured the task-relevant axes. If a linear probe fails but an MLP probe succeeds, the backbone has the information but not in a linearly-accessible form — which is a weaker form of "good representation".
  • Minimises confound from probe-architecture choices. MLP probes require width / depth / regularisation tuning; linear probes have essentially one knob (weight decay).
  • Fast to train. A linear layer on top of fixed features is a convex least-squares (or linear classification) problem — solvable in seconds per task.

When linear probes under-sell a backbone

  • Tasks that genuinely need nonlinear combination of features — say, if the "tone" of a clip depends on interactions between audio pitch + visual colour that no single axis of the backbone captures linearly. MLP probes or fine-tuning would do better but the backbone may still be useful.
  • Tasks where the output is sequence-structured, not a single label — the linear probe is typically per-position; a structured- output head would do better but diverges from the "cheap, standard" spirit.

When linear probes over-sell a backbone

  • Features that happen to linearly separate the specific downstream label but fail to generalise — risk of overfitting the probe to the training split of the downstream data. Standard mitigation: regularised linear probes + cross-validation.
  • When the "frozen features" are actually averaged / pooled over a large input — the averaging may do most of the work and the probe may be low-value signal.

Alternatives on the evaluation menu

Method Cost What it measures
Linear probe (frozen + linear head) Cheapest Linear separability of features
MLP probe (frozen + small MLP head) Medium Feature quality with some non-linearity
Full fine-tune Highest End-to-end performance including backbone plasticity
k-NN probe (no learned head) Very cheap Geometric separability in feature space
Zero-shot (no training at all) Cheapest Generalisation from pre-training to downstream

MediaFM uses linear probes exclusively. VideoPrism + CLIP also report linear / zero-shot / fine-tune numbers across benchmarks.

Relationship to deployment

Linear probes have a second virtue at production: they are exactly the deployment architecture Netflix uses. MediaFM's downstream tasks are not deployed as fine-tuned backbones — they're deployed as the same frozen backbone + task-specific linear head that was evaluated. Evaluation posture == deployment posture, so the evaluation numbers reflect production behaviour directly.

See patterns/frozen-encoder-linear-probe for the deployment pattern.

Caveats

  • Single-feature-vector input assumption. Linear probes typically consume a single vector per input (e.g. a pooled / averaged embedding). For per-token outputs like MediaFM's per-shot embeddings, choosing which embedding or how to pool for the probe is itself a design choice (the post doesn't describe the choice per task).
  • Data-split sensitivity. Linear-probe numbers swing with train/val/test splits; Netflix uses 10-fold for Kendall's τ on clip popularity, which is more robust than a single split, but not described for the other four tasks.

Seen in

Last updated · 319 distilled / 1,201 read