CONCEPT Cited by 1 source

Linear probe evaluation¶

Definition¶

Linear probe evaluation is a standard evaluation methodology for self-supervised foundation models where:

The foundation model's backbone is frozen — its parameters are not updated during downstream training.
For each downstream task, a linear layer (or sometimes a small MLP head) is trained on top of the frozen features, mapping them to the task's output (class, score, rank).
The task metric achieved by this linear-head + frozen-backbone combination is taken as a proxy for the quality of the representations the foundation model has learnt.

The methodology is attractive because:

It isolates the backbone's representational quality from downstream architecture or fine-tuning engineering. Two models can be compared on the same linear probe; differences in probe metric reflect differences in the learned features.
Cheap — one linear layer per task vs fine-tuning the full backbone per task.
Standard — used by CLIP, SimCLR, MAE, BERT, DINOv2, VideoMAE, VideoPrism, and most self-supervised works.

(Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding)

Canonical wiki reference — MediaFM¶

Netflix's MediaFM uses linear probes as its sole evaluation posture across five downstream Netflix tasks:

"To evaluate the learned embeddings, we learn task-specific linear layers on top of frozen representations (i.e., linear probes)."

Five tasks, each with a task-appropriate metric:

Ad Relevancy — multi-label AP.
Clip Popularity Ranking — 10-fold Kendall's τ.
Clip Tone — micro AP over 100 categories.
Clip Genre — macro AP over 11 categories.
Clip Retrieval — AP on 1:3 pos:neg binary "clip-worthy".

MediaFM's encoder is frozen; only the five linear layers are trained per task. The comparison to baselines (other video embeddings — VertexAI MM, Marengo, internal baselines) uses the same linear-probe harness — enforces apples-to-apples representational comparison.

Why linear specifically¶

If the backbone encodes the signal, a linear classifier suffices. Linear separability of a representation is a strong indicator that the backbone has captured the task-relevant axes. If a linear probe fails but an MLP probe succeeds, the backbone has the information but not in a linearly-accessible form — which is a weaker form of "good representation".
Minimises confound from probe-architecture choices. MLP probes require width / depth / regularisation tuning; linear probes have essentially one knob (weight decay).
Fast to train. A linear layer on top of fixed features is a convex least-squares (or linear classification) problem — solvable in seconds per task.

When linear probes under-sell a backbone¶

Tasks that genuinely need nonlinear combination of features — say, if the "tone" of a clip depends on interactions between audio pitch + visual colour that no single axis of the backbone captures linearly. MLP probes or fine-tuning would do better but the backbone may still be useful.
Tasks where the output is sequence-structured, not a single label — the linear probe is typically per-position; a structured- output head would do better but diverges from the "cheap, standard" spirit.

When linear probes over-sell a backbone¶

Features that happen to linearly separate the specific downstream label but fail to generalise — risk of overfitting the probe to the training split of the downstream data. Standard mitigation: regularised linear probes + cross-validation.
When the "frozen features" are actually averaged / pooled over a large input — the averaging may do most of the work and the probe may be low-value signal.

Method	Cost	What it measures
Linear probe (frozen + linear head)	Cheapest	Linear separability of features
MLP probe (frozen + small MLP head)	Medium	Feature quality with some non-linearity
Full fine-tune	Highest	End-to-end performance including backbone plasticity
k-NN probe (no learned head)	Very cheap	Geometric separability in feature space
Zero-shot (no training at all)	Cheapest	Generalisation from pre-training to downstream

MediaFM uses linear probes exclusively. VideoPrism + CLIP also report linear / zero-shot / fine-tune numbers across benchmarks.

Relationship to deployment¶

Linear probes have a second virtue at production: they are exactly the deployment architecture Netflix uses. MediaFM's downstream tasks are not deployed as fine-tuned backbones — they're deployed as the same frozen backbone + task-specific linear head that was evaluated. Evaluation posture == deployment posture, so the evaluation numbers reflect production behaviour directly.

See patterns/frozen-encoder-linear-probe for the deployment pattern.

Caveats¶

Single-feature-vector input assumption. Linear probes typically consume a single vector per input (e.g. a pooled / averaged embedding). For per-token outputs like MediaFM's per-shot embeddings, choosing which embedding or how to pool for the probe is itself a design choice (the post doesn't describe the choice per task).
Data-split sensitivity. Linear-probe numbers swing with train/val/test splits; Netflix uses 10-fold for Kendall's τ on clip popularity, which is more robust than a single split, but not described for the other four tasks.

Seen in¶

sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding — canonical wiki source; explicit "we learn task-specific linear layers on top of frozen representations (i.e., linear probes)" framing applied across all five reported Netflix downstream tasks.

systems/netflix-mediafm — canonical wiki instance.
patterns/frozen-encoder-linear-probe — the deployment pattern that this evaluation methodology aligns with.
concepts/vector-embedding — general concept.