CONCEPT Cited by 1 source
Semantic ID¶
Definition¶
Semantic ID is a hierarchical, discrete content representation derived through coarse-to-fine discretization of continuous content embeddings. Each item is assigned a tuple of integer codes — e.g. (c₁, c₂, c₃, ...) — where each code comes from a separate codebook at a different granularity, and shared prefixes indicate shared semantic category. Popularised by the Recommender Systems with Generative Retrieval line of work (arxiv 2305.05065).
Structurally: take a content embedding, apply a learned vector quantiser at level 1 (broad classes — say 256 clusters), then at level 2 (finer clusters conditional on level-1 assignment), continuing to a chosen depth. The resulting prefix hierarchy gives a stable, category-like notion of semantics that plain embeddings lack.
Why Semantic ID alongside embeddings¶
Canonical framing (Source: sources/2026-04-07-pinterest-evolution-of-multi-objective-optimization-at-pinterest-home):
"While embeddings are excellent at capturing how close two Pins are, they do not always provide a stable, category-like notion of semantics that is useful for controlling diversity. Semantic IDs provide a hierarchical representation derived through coarse-to-fine discretization of content representations, enabling us to reason more explicitly about semantic overlap between items."
Embeddings give you a continuous similarity score. Semantic IDs give you categorical prefix overlap: two items are "similar at level 2" if they share their first two codes. This enables prefix-overlap-based penalties in feed diversification without needing a classifier taxonomy.
Use at Pinterest — Home Feed Blender (Q4 2025)¶
Pinterest added Semantic ID as an SSD diversification signal in Q4 2025. Canonical application: discourage recommending too many Pins with high Semantic ID prefix overlap via a penalty term. Pinterest's stated outcome: "improves both perceived diversity and engagement by reducing repeated content clusters."
Operationally:
- Layered on top of embedding similarity — not a replacement for continuous similarity signals; complementary.
- Prefix-overlap penalty — higher prefix overlap = stronger penalty.
- Stable across re-embedding — unlike raw embedding similarity, Semantic ID prefixes are stable ID-like artefacts; two Pins with the same level-2 prefix are "of the same fine category" in a durable sense.
Production applications beyond diversification¶
Semantic IDs originate in generative retrieval (retrieve by generating the target item's ID autoregressively) but have broader recsys utility:
- Diversification penalties (Pinterest Home Feed Blender — this post).
- Generative retrieval (retrieve by producing the target item's Semantic ID token sequence).
- Stable category surface for logging, monitoring, cold-start handling.
- Hybrid retrieval keys — compose Semantic ID prefix filters with continuous-embedding ANN lookups.
Caveats¶
- Codebook quality depends on embedding quality — garbage-in-garbage-out: the semantic meaning of levels is as clear as the embedding model that produced the input.
- Cardinality budget — product of level cardinalities = total Semantic ID space. Has to be tuned for the corpus; too fine = prefix overlap is rare and useless, too coarse = forcing unrelated items together.
- Retraining / re-quantisation — codebook drift on retrain is a real operational problem; Semantic IDs aren't automatically stable across re-training runs without explicit stability constraints.
- Pinterest discloses the integration but not the codebook sizes, depth, quantiser variant (RQ-VAE, VQ-VAE), or training corpus.
- Not a human-interpretable taxonomy — codes are learned integers; they don't map to human-readable categories without post-hoc labeling.
Seen in¶
- sources/2026-04-07-pinterest-evolution-of-multi-objective-optimization-at-pinterest-home — canonical wiki instance. Semantic ID added to SSD diversification in Q4 2025 as stable category-like signal complementing embedding similarity.
Related¶
- concepts/vector-embedding — continuous-representation companion.
- concepts/feed-diversification — use case.
- concepts/soft-spacing-penalty — composes with Semantic-ID-prefix-overlap penalty in SSD.
- systems/pinterest-home-feed-blender — canonical wiki instance.
- sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference — related vector-embedding production work (non-Semantic-ID).