CONCEPT Cited by 1 source
Linear probe evaluation¶
Definition¶
Linear probe evaluation is a standard evaluation methodology for self-supervised foundation models where:
- The foundation model's backbone is frozen — its parameters are not updated during downstream training.
- For each downstream task, a linear layer (or sometimes a small MLP head) is trained on top of the frozen features, mapping them to the task's output (class, score, rank).
- The task metric achieved by this linear-head + frozen-backbone combination is taken as a proxy for the quality of the representations the foundation model has learnt.
The methodology is attractive because:
- It isolates the backbone's representational quality from downstream architecture or fine-tuning engineering. Two models can be compared on the same linear probe; differences in probe metric reflect differences in the learned features.
- Cheap — one linear layer per task vs fine-tuning the full backbone per task.
- Standard — used by CLIP, SimCLR, MAE, BERT, DINOv2, VideoMAE, VideoPrism, and most self-supervised works.
(Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding)
Canonical wiki reference — MediaFM¶
Netflix's MediaFM uses linear probes as its sole evaluation posture across five downstream Netflix tasks:
"To evaluate the learned embeddings, we learn task-specific linear layers on top of frozen representations (i.e., linear probes)."
Five tasks, each with a task-appropriate metric:
- Ad Relevancy — multi-label AP.
- Clip Popularity Ranking — 10-fold Kendall's τ.
- Clip Tone — micro AP over 100 categories.
- Clip Genre — macro AP over 11 categories.
- Clip Retrieval — AP on 1:3 pos:neg binary "clip-worthy".
MediaFM's encoder is frozen; only the five linear layers are trained per task. The comparison to baselines (other video embeddings — VertexAI MM, Marengo, internal baselines) uses the same linear-probe harness — enforces apples-to-apples representational comparison.
Why linear specifically¶
- If the backbone encodes the signal, a linear classifier suffices. Linear separability of a representation is a strong indicator that the backbone has captured the task-relevant axes. If a linear probe fails but an MLP probe succeeds, the backbone has the information but not in a linearly-accessible form — which is a weaker form of "good representation".
- Minimises confound from probe-architecture choices. MLP probes require width / depth / regularisation tuning; linear probes have essentially one knob (weight decay).
- Fast to train. A linear layer on top of fixed features is a convex least-squares (or linear classification) problem — solvable in seconds per task.
When linear probes under-sell a backbone¶
- Tasks that genuinely need nonlinear combination of features — say, if the "tone" of a clip depends on interactions between audio pitch + visual colour that no single axis of the backbone captures linearly. MLP probes or fine-tuning would do better but the backbone may still be useful.
- Tasks where the output is sequence-structured, not a single label — the linear probe is typically per-position; a structured- output head would do better but diverges from the "cheap, standard" spirit.
When linear probes over-sell a backbone¶
- Features that happen to linearly separate the specific downstream label but fail to generalise — risk of overfitting the probe to the training split of the downstream data. Standard mitigation: regularised linear probes + cross-validation.
- When the "frozen features" are actually averaged / pooled over a large input — the averaging may do most of the work and the probe may be low-value signal.
Alternatives on the evaluation menu¶
| Method | Cost | What it measures |
|---|---|---|
| Linear probe (frozen + linear head) | Cheapest | Linear separability of features |
| MLP probe (frozen + small MLP head) | Medium | Feature quality with some non-linearity |
| Full fine-tune | Highest | End-to-end performance including backbone plasticity |
| k-NN probe (no learned head) | Very cheap | Geometric separability in feature space |
| Zero-shot (no training at all) | Cheapest | Generalisation from pre-training to downstream |
MediaFM uses linear probes exclusively. VideoPrism + CLIP also report linear / zero-shot / fine-tune numbers across benchmarks.
Relationship to deployment¶
Linear probes have a second virtue at production: they are exactly the deployment architecture Netflix uses. MediaFM's downstream tasks are not deployed as fine-tuned backbones — they're deployed as the same frozen backbone + task-specific linear head that was evaluated. Evaluation posture == deployment posture, so the evaluation numbers reflect production behaviour directly.
See patterns/frozen-encoder-linear-probe for the deployment pattern.
Caveats¶
- Single-feature-vector input assumption. Linear probes typically consume a single vector per input (e.g. a pooled / averaged embedding). For per-token outputs like MediaFM's per-shot embeddings, choosing which embedding or how to pool for the probe is itself a design choice (the post doesn't describe the choice per task).
- Data-split sensitivity. Linear-probe numbers swing with train/val/test splits; Netflix uses 10-fold for Kendall's τ on clip popularity, which is more robust than a single split, but not described for the other four tasks.
Seen in¶
- sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding — canonical wiki source; explicit "we learn task-specific linear layers on top of frozen representations (i.e., linear probes)" framing applied across all five reported Netflix downstream tasks.
Related¶
- systems/netflix-mediafm — canonical wiki instance.
- patterns/frozen-encoder-linear-probe — the deployment pattern that this evaluation methodology aligns with.
- concepts/vector-embedding — general concept.