Skip to content

CONCEPT Cited by 1 source

Atomic product ID vs Semantic ID

Definition

A vocabulary-substrate trade-off in recsys retrieval: should the recommender's tokens be atomic product IDs (one token per item, opaque) or Semantic IDs (short codeword sequences, prefix-shared for semantically similar items)?

The trade-off is structural — different downstream architectures become possible / economical with each substrate.

The two substrates side-by-side

Property Atomic product ID Semantic ID
One token per Item Codeword (multiple per item)
Sequence length per item 1 K (typically 4)
Vocabulary size Catalog-bounded (millions) Codebook-bounded (thousands)
Encoding determinism Trivial (item.id is the token) Requires RQ-VAE
Semantic-similarity property None — IDs are opaque Prefix-sharing — similar items share early codewords
Cold-start (new items) Hard — needs training history Easy — codebook covers new items from day 1
Embedding-table size Scales with catalog Scales with codebook (Instacart reports 125× smaller)
Compatible with scoring retrieval Yes (canonical) Yes, but loses some benefit
Compatible with generative retrieval Faces vocabulary bottleneck Designed for it
Substrate stability Stable (item.id doesn't change) Requires codebook-versioning discipline

Why this is a load-bearing decision

The Instacart 2026-06-02 source frames the substrate change as the most structurally consequential part of the rebuild — the generative paradigm follows from the substrate, not the other way around:

"Before we could build this new retrieval system, we needed to change how we represented products."

The post then disclosures the three benefits that follow from the substrate change (systems/instacart-semantic-ids):

  1. Coverage to every item, regardless of purchase history. "A new product entering the catalog is added to one of the existing SIDs and is visible to the model from day one."
  2. Generalisation over memorisation.
  3. Embedding-parameter compression — 125×.

When atomic IDs remain right

Despite the structural advantages of Semantic IDs for generative retrieval, atomic IDs remain right under several conditions:

  • The model is downstream of retrieval, not retrieval itself. Ranking models (e.g. Carrot Ads pCTR) often consume atomic IDs because the candidate set is already small and per-item features are richer than codeword embeddings.
  • No suitable codebook can be learned. Items without learnable feature representations (sparse, low-cardinality, abstract) don't have meaningful semantic neighbourhoods and RQ-VAE produces arbitrary codebooks. Better to stay with atomic IDs.
  • Stable IDs are required across model versions. Atomic IDs are fixed by definition; Semantic IDs require codebook-versioning discipline to avoid version skew when retrained.
  • Catalog is small and stationary. Vocabulary bottleneck is acute when catalogs are large and non-stationary. A small stationary catalog (e.g. fixed-set product catalog with infrequent updates) doesn't trigger the failure mode.

When Semantic IDs become inevitable

Three conditions where Semantic IDs are essentially required:

  1. Generative retrieval is the goal. Atomic IDs as the generation vocabulary face the vocabulary bottleneck — you'd be generating tokens from a catalog-sized vocabulary, defeating the purpose.
  2. Cold-start coverage of new catalog items is critical. Atomic IDs require historical sessions; Semantic IDs cover everything from day 1.
  3. Embedding-parameter budget is a binding constraint. The 125× compression Instacart reports is what makes the GPU serving stack economically viable.

Caveats

  • The substrate change is necessary but not sufficient for the benefits — Instacart paired it with the generative retrieval model and the GPU serving stack (patterns/gpu-serving-stack-tensorrt-llm-triton). Substrate alone, without consuming-model and serving-stack changes, would not have produced the operational outcomes.
  • Substrate stability across codebook re-training is a non-trivial concern for Semantic IDs and does not have a well-published industry solution.
  • Some hybrid architectures (atomic IDs with embeddings derived from item features) sit between the two substrates and capture some Semantic-ID benefits without the codebook discipline cost; these aren't covered in the 2026-06 source.

Seen in

Last updated · 542 distilled / 1,571 read