CONCEPT Cited by 1 source

Atomic product ID vs Semantic ID¶

Definition¶

A vocabulary-substrate trade-off in recsys retrieval: should the recommender's tokens be atomic product IDs (one token per item, opaque) or Semantic IDs (short codeword sequences, prefix-shared for semantically similar items)?

The trade-off is structural — different downstream architectures become possible / economical with each substrate.

The two substrates side-by-side¶

Property	Atomic product ID	Semantic ID
One token per	Item	Codeword (multiple per item)
Sequence length per item	1	K (typically 4)
Vocabulary size	Catalog-bounded (millions)	Codebook-bounded (thousands)
Encoding determinism	Trivial (item.id is the token)	Requires RQ-VAE
Semantic-similarity property	None — IDs are opaque	Prefix-sharing — similar items share early codewords
Cold-start (new items)	Hard — needs training history	Easy — codebook covers new items from day 1
Embedding-table size	Scales with catalog	Scales with codebook (Instacart reports 125× smaller)
Compatible with scoring retrieval	Yes (canonical)	Yes, but loses some benefit
Compatible with generative retrieval	Faces vocabulary bottleneck	Designed for it
Substrate stability	Stable (item.id doesn't change)	Requires codebook-versioning discipline

Why this is a load-bearing decision¶

The Instacart 2026-06-02 source frames the substrate change as the most structurally consequential part of the rebuild — the generative paradigm follows from the substrate, not the other way around:

"Before we could build this new retrieval system, we needed to change how we represented products."

The post then disclosures the three benefits that follow from the substrate change (systems/instacart-semantic-ids):

Coverage to every item, regardless of purchase history. "A new product entering the catalog is added to one of the existing SIDs and is visible to the model from day one."
Generalisation over memorisation.
Embedding-parameter compression — 125×.

When atomic IDs remain right¶

Despite the structural advantages of Semantic IDs for generative retrieval, atomic IDs remain right under several conditions:

The model is downstream of retrieval, not retrieval itself. Ranking models (e.g. Carrot Ads pCTR) often consume atomic IDs because the candidate set is already small and per-item features are richer than codeword embeddings.
No suitable codebook can be learned. Items without learnable feature representations (sparse, low-cardinality, abstract) don't have meaningful semantic neighbourhoods and RQ-VAE produces arbitrary codebooks. Better to stay with atomic IDs.
Stable IDs are required across model versions. Atomic IDs are fixed by definition; Semantic IDs require codebook-versioning discipline to avoid version skew when retrained.
Catalog is small and stationary. Vocabulary bottleneck is acute when catalogs are large and non-stationary. A small stationary catalog (e.g. fixed-set product catalog with infrequent updates) doesn't trigger the failure mode.

When Semantic IDs become inevitable¶

Three conditions where Semantic IDs are essentially required:

Generative retrieval is the goal. Atomic IDs as the generation vocabulary face the vocabulary bottleneck — you'd be generating tokens from a catalog-sized vocabulary, defeating the purpose.
Cold-start coverage of new catalog items is critical. Atomic IDs require historical sessions; Semantic IDs cover everything from day 1.
Embedding-parameter budget is a binding constraint. The 125× compression Instacart reports is what makes the GPU serving stack economically viable.

Caveats¶

The substrate change is necessary but not sufficient for the benefits — Instacart paired it with the generative retrieval model and the GPU serving stack (patterns/gpu-serving-stack-tensorrt-llm-triton). Substrate alone, without consuming-model and serving-stack changes, would not have produced the operational outcomes.
Substrate stability across codebook re-training is a non-trivial concern for Semantic IDs and does not have a well-published industry solution.
Some hybrid architectures (atomic IDs with embeddings derived from item features) sit between the two substrates and capture some Semantic-ID benefits without the codebook discipline cost; these aren't covered in the 2026-06 source.

Seen in¶

sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart — Instacart's substrate change from atomic product IDs (in CR) to Semantic IDs (in generative ads retrieval).

concepts/semantic-id — the alternative substrate.
concepts/generative-retrieval — paradigm enabled by Semantic IDs.
concepts/vocabulary-bottleneck — failure mode of atomic IDs at non-stationary catalog scale.
concepts/cold-start — recsys cold-start axis.
systems/instacart-semantic-ids / systems/rq-vae / systems/tiger-generative-retrieval — the algorithmic substrate.