SYSTEM Cited by 2 sources

CLIP (Contrastive Language-Image Pretraining)¶

Definition¶

CLIP (Contrastive Language-Image Pretraining) is OpenAI's open-source multimodal embedding model (paper, arXiv:2103.00020) that learns to map both images and text into a shared vector space such that a caption's embedding and the matching image's embedding are close under cosine similarity. Trained via contrastive learning on ~400M (image, text) pairs scraped from the public web.

Because both modalities produce vectors in the same space, CLIP collapses what would otherwise be two separate retrieval systems (text-over-text + image-over-image) into one vector index that serves both text and image queries. A text query "cat" can retrieve image vectors; an image query can retrieve text vectors or other image vectors; both use the same nearest-neighbour lookup.

Why it matters architecturally¶

Most embedding models are unimodal — a text model embeds only text, an image model embeds only images. Building multimodal retrieval on unimodal models requires either:

Two indexes — one per modality, with a routing layer choosing which index a query hits (brittle when queries mix modalities).
Modality translation — convert one modality to the other first (image → caption via a VLM, then text-embed the caption). Expensive, lossy, and adds an additional model to the hot path.

A multimodal model like CLIP sidesteps both — one index, one query path, no per-modality routing. For a product where the user query can be a screenshot, a selected frame, or a text string, this is the decisive architectural property.

Figma AI Search instance¶

"Figma currently uses the open source CLIP model, which is what is known as a multimodal embedding model. The model can take multiple forms of inputs (image and text) and output embeddings that are in the same space. This means that an embedding for the string 'cat' will be numerically similar to the embedding above, even though the first was generated with an image as input." (Source: sources/2026-04-21-figma-the-infrastructure-behind-ai-search-in-figma)

Shape of Figma's use:

Designs search model. CLIP, with fine-tuning performed on images of user interfaces from public, free Community files. No training on private Figma files or customer data.
Components search model. A very similar model to the designs model, fine-tuned specifically on publicly available Community UI kits. Also no training on private data.
Deployed in AWS SageMaker for batched inference (systems/aws-sagemaker-endpoint).
Input at inference time: a set of thumbnail image URLs (rendered by a headless server-side C++ Figma editor, thumbnailed via llvmpipe CPU rendering on newer EC2 instances, uploaded to S3). Inside the container image download + resize + normalise are parallelised; batch-size has a sweet spot past which latency grows linearly with batch size.
Query-side: a screenshot / selection is embedded on-the-fly; a text query is embedded on-the-fly — the same model consumes both.
Figma initially tried embedding a textual JSON representation of the user's selection instead of rendering it to an image — image embeddings produced better results and shared the code path with screenshot queries, so the JSON route was dropped.

Contrastive pretraining (mechanics, one level down)¶

CLIP was trained on a batch-level contrastive objective:

For a batch of N (image, caption) pairs, compute the N × N similarity matrix of image-embeddings against text-embeddings.
Optimize the N matching pairs (diagonal) to have high similarity and the N² - N non-matching pairs (off-diagonal) to have low similarity.
Symmetric cross-entropy loss in both directions.

Net effect: the image encoder and text encoder learn a shared embedding space where semantic alignment across modalities is preserved.

Operational profile¶

Dimensionality. The original OpenAI CLIP models release at 512 or 768 dims depending on the variant (ResNet / ViT backbone, different sizes). Figma does not disclose which variant or dimension. See concepts/vector-embedding on typical dimensions.
Distance metric. Cosine similarity — CLIP is trained against a dot-product-with-normalization objective, so cosine is the correct pairing. See concepts/vector-similarity-search.
Fine-tuning vs zero-shot. CLIP out of the box is famously strong zero-shot, but domain-specific fine-tuning (as Figma does for UI screenshots) meaningfully improves in-domain retrieval.
Open-source. Figma's choice of CLIP is partly that OpenAI released weights + architecture; the model is reproducible and self-hostable (on SageMaker, in Figma's case), avoiding per-call vendor inference fees.

When CLIP fits¶

Products with mixed text + image queries where unifying the index is architecturally important (Figma AI Search, content- moderation systems, image-to-image similarity with textual augmentation).
Teams that want to self-host an embedding model rather than calling a vendor API per embedding.
Domains where zero-shot performance is competitive or where fine-tuning a reasonably sized checkpoint on domain-specific pairs is feasible.

When CLIP might not fit¶

Text-only workloads. Overkill; a dedicated text embedding (Titan V2, Cohere embed, OpenAI text-embedding-3-*, BGE) may beat CLIP on text-text retrieval because they're optimized for it.
Very fine-grained visual distinctions outside CLIP's training distribution (medical imaging, technical schematics) — fine-tuning or a domain-specific model is typically required.
Embedding throughput bottlenecks — CLIP models are larger than typical text embedders; batched GPU/CPU inference economics need to pencil out.

Seen in¶

sources/2026-04-21-figma-the-infrastructure-behind-ai-search-in-figma — CLIP named, arXiv cited, multimodal-same-space property explicitly called out as the architectural enabler; deployed in SageMaker, fine- tuned separately for designs and components on public Community data.
sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding — CLIP family as an internal ancestor, not as CLIP itself. Netflix's SeqCLIP is described as "a CLIP-style model fine-tuned on video retrieval datasets" — used inside MediaFM as the frozen video-modality sub-encoder embedding frames sampled per shot. Shows the pattern of building domain-specific CLIP descendants (CLIP → fine-tune on a domain-specific (image, text) corpus → use as a frozen feature extractor in a downstream multimodal pipeline).