Skip to content

CONCEPT Cited by 5 sources

Vector Embedding

A vector embedding is a dense numerical representation of a piece of unstructured data — text, image, audio, video, or document — produced by an embedding model, such that semantically similar inputs map to vectors that are close under a chosen distance metric (cosine, Euclidean, dot-product). The embedding is a fixed-length array of floats (commonly float32), with length (dimensionality) determined by the model (e.g. 1024 for Amazon Titan Text Embeddings V2 at its default config).

(Source: sources/2025-07-16-aws-amazon-s3-vectors-preview-launch)

Canonical framing (Channy Yun, AWS, 2025)

"Vectors are numerical representation of unstructured data created from embedding models. You use embedding models to generate vector embeddings of your data and store them in S3 Vectors to perform semantic searches."

"Vector search is an emerging technique used in generative AI applications to find similar data points to given data by comparing their vector representations using distance or similarity metrics."

What the embedding enables

  1. Semantic search — find documents about the same concept, not just those containing the same keywords.
  2. Retrieval-Augmented Generation (RAG) — given a user query, embed it, find nearest-neighbour embeddings in a corpus, retrieve those documents, feed them as context to an LLM.
  3. Recommendation / similarity — "more like this" over items without a hand-curated similarity function.
  4. Clustering / deduplication — group near-duplicate content.

Dimensionality

Each model produces vectors at a fixed dimensionality. All vectors stored in a single S3 Vectors index must share dimensionality — this is the model's output shape, pinned at index creation time.

Common dimensionalities: 384 (MiniLM), 768 (BERT-base, Titan V1), 1024–1536 (OpenAI text-embedding-3-small, Titan V2, Cohere), 3072 (OpenAI text-embedding-3-large).

Byte cost

The launch post flags the structural cost issue: for text-heavy corpora like code or PDFs, "the vectors themselves were often more bytes than the data being indexed" — a 4 KB document can produce a 1024-dim float32 embedding = 4 KB. This is the motivator for storage-tier pricing: vectors demand cheap bulk storage as much as or more than their source data. (Source: sources/2026-04-07-allthingsdistributed-s3-files-and-the-changing-face-of-s3)

Pairing with distance metrics

Embedding models are trained against a specific distance metric (most often cosine or inner-product). Querying with the wrong metric can materially reduce recall:

"When creating vector embeddings, select your embedding model's recommended distance metric for more accurate results."

See concepts/vector-similarity-search for metric choices.

Summed-attribute embeddings in sequence modeling

Recommendation systems often build a per-action embedding by summing embeddings of that action's attributes rather than allocating a token per attribute. Airbnb's destination recommender sums embedding(city) + embedding(region) + embedding(days-to-today) to get a single per-action token that a transformer attention layer then aggregates across the sequence. Summation (vs concatenation) keeps dimensionality fixed and shares gradients across attributes, effectively letting the model learn joint attribute geometry. (Source: sources/2026-03-12-airbnb-destination-recommendation-transformer)

See concepts/user-action-as-token for the full sequence-modeling framing this composition serves.

Multimodal: text and images in one space

Some embedding models are multimodal — a single model embeds multiple input types (text + image, or text + audio) into the same vector space such that semantically matched pairs (a caption and its image) embed closely. The architectural implication is large: one vector index serves queries in either modality, no routing or translation layer needed.

OpenAI CLIP is the canonical open-source multimodal text+image model. Figma AI Search uses CLIP precisely for this property — users query by screenshot, by frame-selection-rendered-to-screenshot, or by text, and all three hit the same OpenSearch k-NN index.

"The model can take multiple forms of inputs (image and text) and output embeddings that are in the same space. This means that an embedding for the string 'cat' will be numerically similar to the embedding above, even though the first was generated with an image as input." (Figma Engineering, 2026-04-21)

Text-only vs image-only embedding models cannot be substituted for multimodal models here — each would produce a vector in its own space, and cross-modal nearest-neighbour queries would be meaningless.

Image vs JSON-text embeddings — a Figma datapoint

Figma initially tried embedding a textual JSON representation of the user's Figma-layer selection rather than rendering it to an image first. Image-derived embeddings produced better results and let screenshot-based queries share the same code path, so the JSON route was dropped (Source: sources/2026-04-21-figma-the-infrastructure-behind-ai-search-in-figma). For multimodal models like CLIP trained heavily on image inputs, the "render to an image first" preprocessing is a stronger signal than any structural textual proxy.

Seen in

Last updated · 200 distilled / 1,178 read