Skip to content

CONCEPT Cited by 1 source

Semantic ID

Definition

Semantic ID is a hierarchical, discrete content representation derived through coarse-to-fine discretization of continuous content embeddings. Each item is assigned a tuple of integer codes — e.g. (c₁, c₂, c₃, ...) — where each code comes from a separate codebook at a different granularity, and shared prefixes indicate shared semantic category. Popularised by the Recommender Systems with Generative Retrieval line of work (arxiv 2305.05065).

Structurally: take a content embedding, apply a learned vector quantiser at level 1 (broad classes — say 256 clusters), then at level 2 (finer clusters conditional on level-1 assignment), continuing to a chosen depth. The resulting prefix hierarchy gives a stable, category-like notion of semantics that plain embeddings lack.

Why Semantic ID alongside embeddings

Canonical framing (Source: sources/2026-04-07-pinterest-evolution-of-multi-objective-optimization-at-pinterest-home):

"While embeddings are excellent at capturing how close two Pins are, they do not always provide a stable, category-like notion of semantics that is useful for controlling diversity. Semantic IDs provide a hierarchical representation derived through coarse-to-fine discretization of content representations, enabling us to reason more explicitly about semantic overlap between items."

Embeddings give you a continuous similarity score. Semantic IDs give you categorical prefix overlap: two items are "similar at level 2" if they share their first two codes. This enables prefix-overlap-based penalties in feed diversification without needing a classifier taxonomy.

Use at Pinterest — Home Feed Blender (Q4 2025)

Pinterest added Semantic ID as an SSD diversification signal in Q4 2025. Canonical application: discourage recommending too many Pins with high Semantic ID prefix overlap via a penalty term. Pinterest's stated outcome: "improves both perceived diversity and engagement by reducing repeated content clusters."

Operationally:

  • Layered on top of embedding similarity — not a replacement for continuous similarity signals; complementary.
  • Prefix-overlap penalty — higher prefix overlap = stronger penalty.
  • Stable across re-embedding — unlike raw embedding similarity, Semantic ID prefixes are stable ID-like artefacts; two Pins with the same level-2 prefix are "of the same fine category" in a durable sense.

Production applications beyond diversification

Semantic IDs originate in generative retrieval (retrieve by generating the target item's ID autoregressively) but have broader recsys utility:

  • Diversification penalties (Pinterest Home Feed Blender — this post).
  • Generative retrieval (retrieve by producing the target item's Semantic ID token sequence).
  • Stable category surface for logging, monitoring, cold-start handling.
  • Hybrid retrieval keys — compose Semantic ID prefix filters with continuous-embedding ANN lookups.

Caveats

  • Codebook quality depends on embedding quality — garbage-in-garbage-out: the semantic meaning of levels is as clear as the embedding model that produced the input.
  • Cardinality budget — product of level cardinalities = total Semantic ID space. Has to be tuned for the corpus; too fine = prefix overlap is rare and useless, too coarse = forcing unrelated items together.
  • Retraining / re-quantisation — codebook drift on retrain is a real operational problem; Semantic IDs aren't automatically stable across re-training runs without explicit stability constraints.
  • Pinterest discloses the integration but not the codebook sizes, depth, quantiser variant (RQ-VAE, VQ-VAE), or training corpus.
  • Not a human-interpretable taxonomy — codes are learned integers; they don't map to human-readable categories without post-hoc labeling.

Seen in

Last updated · 319 distilled / 1,201 read