SYSTEM Cited by 1 source
OpenAI text-embedding-3-large¶
Definition¶
text-embedding-3-large is OpenAI's large-tier text embedding
model in the text-embedding-3 family, described on the OpenAI
blog.
It produces a 3072-dim dense embedding by default (with
configurable lower dimensions via the dimensions parameter) from
arbitrary text input, served through the OpenAI embeddings API.
It is referenced in this wiki in the context of MediaFM where it is used twice:
- As the timed-text-modality sub-encoder — per shot, closed
captions / audio descriptions / subtitles for the shot's time
range are passed through
text-embedding-3-largeto produce the text contribution to the shot's fused embedding. - As the title-metadata encoder for MediaFM's
[GLOBAL]token — title-level metadata ("synopses and tags") is passed through the same model, then the output is projected into the Transformer's hidden dimension and prepended as the second token of every training sequence.
(Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding)
Role as a multimodal building block¶
Netflix's MediaFM is a canonical instance of a tri-modal foundation model on top of frozen pre-trained unimodal encoders — text is handled via vendor API (OpenAI), not via a Netflix- trained text encoder. The trade-off Netflix accepts:
- Win: zero text-encoder training cost + access to a state-of-the-art text model by calling a public API.
- Cost: a hard vendor dependency in the hot path of every
shot-level embedding computation + every new-title onboarding.
The post
does not describe Netflix's approach to caching embeddings,
handling API outages, or versioning upgrades (re-embedding when
OpenAI releases
text-embedding-4). - Ambivalence about modality source: MediaFM goes full-tilt internal for video (SeqCLIP) but full-tilt external for text, with audio coming from a third lab (Meta FAIR wav2vec2).
What's not disclosed¶
- Whether Netflix uses the full 3072-dim output or reduces
dimensionality via the API's
dimensionsparameter. - Caching / invalidation strategy for per-shot + per-title embeddings at Netflix catalog scale.
- Handling of non-English timed text (captions in 30+ languages
for Netflix's catalog);
text-embedding-3-largeis marketed as multilingual but the effective quality across languages is not characterised in the post. - How per-shot timed-text segments (potentially multiple caption lines per shot) are concatenated before embedding.
Seen in¶
- sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding
— named as both the shot-level timed-text sub-encoder and the
title-level
[GLOBAL]-token metadata encoder inside MediaFM.
Related¶
- systems/netflix-mediafm — consumer (in this wiki).
- concepts/vector-embedding — general concept.
- patterns/tri-modal-embedding-fusion — consumption pattern.