CONCEPT Cited by 1 source
Atomic product ID vs Semantic ID¶
Definition¶
A vocabulary-substrate trade-off in recsys retrieval: should the recommender's tokens be atomic product IDs (one token per item, opaque) or Semantic IDs (short codeword sequences, prefix-shared for semantically similar items)?
The trade-off is structural — different downstream architectures become possible / economical with each substrate.
The two substrates side-by-side¶
| Property | Atomic product ID | Semantic ID |
|---|---|---|
| One token per | Item | Codeword (multiple per item) |
| Sequence length per item | 1 | K (typically 4) |
| Vocabulary size | Catalog-bounded (millions) | Codebook-bounded (thousands) |
| Encoding determinism | Trivial (item.id is the token) | Requires RQ-VAE |
| Semantic-similarity property | None — IDs are opaque | Prefix-sharing — similar items share early codewords |
| Cold-start (new items) | Hard — needs training history | Easy — codebook covers new items from day 1 |
| Embedding-table size | Scales with catalog | Scales with codebook (Instacart reports 125× smaller) |
| Compatible with scoring retrieval | Yes (canonical) | Yes, but loses some benefit |
| Compatible with generative retrieval | Faces vocabulary bottleneck | Designed for it |
| Substrate stability | Stable (item.id doesn't change) | Requires codebook-versioning discipline |
Why this is a load-bearing decision¶
The Instacart 2026-06-02 source frames the substrate change as the most structurally consequential part of the rebuild — the generative paradigm follows from the substrate, not the other way around:
"Before we could build this new retrieval system, we needed to change how we represented products."
The post then disclosures the three benefits that follow from the substrate change (systems/instacart-semantic-ids):
- Coverage to every item, regardless of purchase history. "A new product entering the catalog is added to one of the existing SIDs and is visible to the model from day one."
- Generalisation over memorisation.
- Embedding-parameter compression — 125×.
When atomic IDs remain right¶
Despite the structural advantages of Semantic IDs for generative retrieval, atomic IDs remain right under several conditions:
- The model is downstream of retrieval, not retrieval itself. Ranking models (e.g. Carrot Ads pCTR) often consume atomic IDs because the candidate set is already small and per-item features are richer than codeword embeddings.
- No suitable codebook can be learned. Items without learnable feature representations (sparse, low-cardinality, abstract) don't have meaningful semantic neighbourhoods and RQ-VAE produces arbitrary codebooks. Better to stay with atomic IDs.
- Stable IDs are required across model versions. Atomic IDs are fixed by definition; Semantic IDs require codebook-versioning discipline to avoid version skew when retrained.
- Catalog is small and stationary. Vocabulary bottleneck is acute when catalogs are large and non-stationary. A small stationary catalog (e.g. fixed-set product catalog with infrequent updates) doesn't trigger the failure mode.
When Semantic IDs become inevitable¶
Three conditions where Semantic IDs are essentially required:
- Generative retrieval is the goal. Atomic IDs as the generation vocabulary face the vocabulary bottleneck — you'd be generating tokens from a catalog-sized vocabulary, defeating the purpose.
- Cold-start coverage of new catalog items is critical. Atomic IDs require historical sessions; Semantic IDs cover everything from day 1.
- Embedding-parameter budget is a binding constraint. The 125× compression Instacart reports is what makes the GPU serving stack economically viable.
Caveats¶
- The substrate change is necessary but not sufficient for the benefits — Instacart paired it with the generative retrieval model and the GPU serving stack (patterns/gpu-serving-stack-tensorrt-llm-triton). Substrate alone, without consuming-model and serving-stack changes, would not have produced the operational outcomes.
- Substrate stability across codebook re-training is a non-trivial concern for Semantic IDs and does not have a well-published industry solution.
- Some hybrid architectures (atomic IDs with embeddings derived from item features) sit between the two substrates and capture some Semantic-ID benefits without the codebook discipline cost; these aren't covered in the 2026-06 source.
Seen in¶
- sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart — Instacart's substrate change from atomic product IDs (in CR) to Semantic IDs (in generative ads retrieval).
Related¶
- concepts/semantic-id — the alternative substrate.
- concepts/generative-retrieval — paradigm enabled by Semantic IDs.
- concepts/vocabulary-bottleneck — failure mode of atomic IDs at non-stationary catalog scale.
- concepts/cold-start — recsys cold-start axis.
- systems/instacart-semantic-ids / systems/rq-vae / systems/tiger-generative-retrieval — the algorithmic substrate.