SYSTEM Cited by 1 source

OpenAI text-embedding-3-large¶

Definition¶

text-embedding-3-large is OpenAI's large-tier text embedding model in the text-embedding-3 family, described on the OpenAI blog. It produces a 3072-dim dense embedding by default (with configurable lower dimensions via the dimensions parameter) from arbitrary text input, served through the OpenAI embeddings API.

It is referenced in this wiki in the context of MediaFM where it is used twice:

As the timed-text-modality sub-encoder — per shot, closed captions / audio descriptions / subtitles for the shot's time range are passed through text-embedding-3-large to produce the text contribution to the shot's fused embedding.
As the title-metadata encoder for MediaFM's [GLOBAL] token — title-level metadata ("synopses and tags") is passed through the same model, then the output is projected into the Transformer's hidden dimension and prepended as the second token of every training sequence.

(Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding)

Role as a multimodal building block¶

Netflix's MediaFM is a canonical instance of a tri-modal foundation model on top of frozen pre-trained unimodal encoders — text is handled via vendor API (OpenAI), not via a Netflix- trained text encoder. The trade-off Netflix accepts:

Win: zero text-encoder training cost + access to a state-of-the-art text model by calling a public API.
Cost: a hard vendor dependency in the hot path of every shot-level embedding computation + every new-title onboarding. The post does not describe Netflix's approach to caching embeddings, handling API outages, or versioning upgrades (re-embedding when OpenAI releases text-embedding-4).
Ambivalence about modality source: MediaFM goes full-tilt internal for video (SeqCLIP) but full-tilt external for text, with audio coming from a third lab (Meta FAIR wav2vec2).

What's not disclosed¶

Whether Netflix uses the full 3072-dim output or reduces dimensionality via the API's dimensions parameter.
Caching / invalidation strategy for per-shot + per-title embeddings at Netflix catalog scale.
Handling of non-English timed text (captions in 30+ languages for Netflix's catalog); text-embedding-3-large is marketed as multilingual but the effective quality across languages is not characterised in the post.
How per-shot timed-text segments (potentially multiple caption lines per shot) are concatenated before embedding.

Seen in¶

sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding — named as both the shot-level timed-text sub-encoder and the title-level [GLOBAL]-token metadata encoder inside MediaFM.

systems/netflix-mediafm — consumer (in this wiki).
concepts/vector-embedding — general concept.
patterns/tri-modal-embedding-fusion — consumption pattern.

OpenAI text-embedding-3-large¶

Definition¶

Role as a multimodal building block¶

What's not disclosed¶

Seen in¶

Related¶