CONCEPT Cited by 1 source

Vocabulary bottleneck¶

Definition¶

The vocabulary bottleneck is a structural failure mode of recsys retrieval architectures that score / score-to-distribute over a fixed atomic-item-ID vocabulary. As the catalog grows, three costs grow with it, none of which can be addressed by tuning:

Model size — embedding tables sized to the vocabulary balloon.
Inference latency — the per-request scoring cost grows with vocabulary size.
Data sparsity for tail items — items with few historical sessions become noisy / undertrained tokens.

Quote (Source: sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart):

"The CR model relies on atomic product IDs as distinct tokens, which establishes the boundaries of what the model can interpret and predict. While expanding this vocabulary enhances the model's ability to understand the detailed context of a user's session, it simultaneously increases model size and latency while creating data sparsity for less common items. Additionally this catalog is non-stationary. As new products are added to the catalog, the coverage gap keeps expanding."

The non-stationarity twist¶

What makes this a structural rather than tunable problem: the catalog is non-stationary. New products arrive faster than the training cycle can re-incorporate them. Even a model that currently covers the catalog will find its coverage shrinking between training runs — "the coverage gap keeps expanding". There is no static vocabulary size that solves this.

Three escape routes¶

Three structural responses to the vocabulary bottleneck appear on the wiki:

1. Hash to bucketed embeddings¶

Map atomic product IDs to a fixed-size hashed embedding space. Loses specificity (collisions) but bounds model size. Used in classical recsys but rarely on modern sequence-model retrieval where collisions dominate.

2. Two-tower with embeddings¶

Replace per-ID scoring with embedding-space ANN search. Embedding table still scales with catalog (so doesn't fully escape the bottleneck) but moves scoring out of the model graph into an ANN service. New items can be added to the index without retraining the query tower (assuming features-derived embeddings).

3. Semantic IDs over a learned codebook¶

Replace atomic IDs with discrete codeword sequences from a fixed codebook. Vocabulary becomes codebook-bounded (~tens of thousands of codewords), not catalog-bounded (~tens of millions of products). This is the substrate change in Instacart's generative ads retrieval (2026-06-02) and TIGER (Google DeepMind, NeurIPS 2023).

The vocabulary bottleneck is a sibling of but not identical to:

Recsys cold-start — both involve new items being undertrained. Cold-start is broader (covers new-user, new-domain, new-partner cases too); the vocabulary bottleneck is specifically the vocabulary-as-fixed-set structural property that makes cold-start unsolvable in scoring architectures.
Data sparsity — broader category; the vocabulary bottleneck is one specific structural cause of data sparsity (tail items in a catalog-sized vocabulary).
Model-capacity vs vocabulary trade-off — the "increases model size and latency" arm of the bottleneck. Instacart's CR could theoretically have served larger vocabularies at higher cost, but this is what makes the trade-off structural — there's no free point on the curve.

Caveats¶

The bottleneck applies most sharply to scoring retrieval (probability-distribution-over-vocabulary). Generative retrieval over codebooks dissolves it; generative retrieval over atomic IDs would still face it (which is why TIGER pairs generation with RQ-VAE codebooks).
The bottleneck applies most sharply to non-stationary catalogs (e-commerce, news, social media). Stationary catalogs (legacy product catalogs with infrequent updates) face the bottleneck less acutely.
"Bottleneck" implies a single chokepoint; in practice the three costs (size, latency, sparsity) compose multiplicatively and a fix to any one without the others doesn't help much.

Seen in¶

sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart — canonicalised as the first of three structural ceilings of Instacart's prior CR scoring model that motivated the move to generative retrieval.

concepts/generative-retrieval — the paradigm that escapes the bottleneck.
concepts/semantic-id / concepts/atomic-product-id-vs-semantic-id — the vocabulary-substrate alternatives.
concepts/cold-start — sibling failure mode.
concepts/two-tower-architecture — partial-escape architecture.
systems/instacart-contextual-recommendations — production example that hit the bottleneck.
systems/instacart-generative-ads-retrieval — successor that dissolves it.