Skip to content

CONCEPT Cited by 1 source

Embedding Collection

Definition

An embedding collection is the organizational unit of a vector database: a named, schema-pinned bucket of vector embeddings that share

  • dimensionality (fixed by the producing embedding model),
  • distance metric (cosine / Euclidean / inner product — also determined by the producing model),
  • index structure (HNSW / IVF / DiskANN / flat — determined at collection-creation time),
  • and, in platform-grade deployments, metadata pinning the collection to a specific producing model / model version, consuming service, and schema.

(Source: sources/2026-01-06-expedia-powering-vector-embedding-capabilities)

Why the collection is a first-class unit

Vectors on their own are opaque floats. The collection is the boundary at which "these vectors are comparable to each other and meaningful to this consumer" becomes enforceable:

  1. Comparability. Similarity search only makes sense between vectors produced by the same model version under the same distance metric. The collection pins both, so a query vector and the corpus it's queried against always agree.
  2. Governance and discovery. When different teams want to know whether a corpus-of-interest already has embeddings, the collection is the searchable unit — keyed by (service, model, version, schema).
  3. Evolution without breakage. Changing embedding model, dimensionality, distance metric, or index is a new collection — not an in-place upgrade. Two versions coexist; consumers cut over when ready.
  4. Lifecycle primitive. Creation, backfill, re-index, deletion, and retention are all collection-scoped operations.

Platform realization: Expedia

Expedia's Embedding Store Service uses systems/feast to register each collection with structured metadata: associated service (system that generates / consumes the embeddings) + embedding model that produced them. Three named payoffs:

  1. Data consistency"the collection definition guarantees that all embeddings in a collection are linked to consistent metadata, such as the model and service they are associated with."
  2. Search and discoverability"users can easily locate collections based on components of its metadata, such as a specific model or version."
  3. Version management"multiple versions of the same dataset, tailored to different needs and scenarios, can be created based on various factors such as different embedding models or model versions, modifying the indexing algorithms to suit various use cases or modifying the schema."

(Source: sources/2026-01-06-expedia-powering-vector-embedding-capabilities)

Typical collection metadata

Platform-grade collections carry:

Field Role
name Human-addressable handle
model + model_version Pins comparability and distance metric
dimensionality Validated on every write
distance_metric Cosine / Euclidean / dot product
index_type HNSW / IVF / DiskANN / flat — trade off recall vs cost
schema Additional columns the vectors carry for hybrid search filters
associated_service Consumer / producer for discovery + ownership
created_at, version Lineage for restore / rollback

Why this is a concept and not just a data-model detail

The collection structurally determines:

  • What vectors are comparable — preventing cross-model queries that would silently give bad recall.
  • What hybrid filters are possible — the schema defines the attributes a hybrid search can filter on.
  • How migrations are staged — new collection + backfill from the offline store + dual serving during rollout + cutover.
  • Who owns what — the associated_service field is the ownership anchor in a multi-team ML platform.

Seen in

Last updated · 200 distilled / 1,178 read