CONCEPT Cited by 1 source

Precision-vs-discovery codebook flavor¶

Definition¶

Precision-vs-discovery codebook flavor is a design axis for Semantic ID codebooks: the same RQ-VAE quantizer + contrastive-loss machinery can be trained against different upstream embeddings to produce codebooks with different cluster character — precision (tight substitute clusters) or discovery (broader thematic clusters).

The choice is per surface, not universal. A single recsys platform can run two parallel codebooks and route different consumers to different flavors based on which serving surface needs which character.

Quote (Source: sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale):

"Neither is universally better. The key is matching the right flavor to the right surface."

The two flavors at Instacart¶

Instacart ships two codebook flavors:

Flavor	Upstream embedding	Cluster character	Use cases
ESCI (precision)	Raw product text → in-house ESCI search-relevance model (trained on query-product matching, Exact / Substitute / Complementary / Irrelevant)	Tight substitute clusters; e.g. Whole Bean Coffee (`0_8_55_72`) where every item is a medium roast from a different brand, interchangeable for any customer who wants whole bean coffee	Substitution, search, reordering
ESCI+Gemma (discovery)	Gemini Flash extracts structured attributes (product type, key ingredients, dietary tags, format) and strips marketing copy → Gemma (off-the-shelf) embeds the cleaned representation	Broader clusters that capture lifestyle and usage patterns	Homepage feeds, cross-selling, exploration

The architectural insight:

"The embedding is the decision. The RQ-VAE compresses whatever structure the embedding space gives it. Choose your embedding based on the business problem."

The downstream RQ-VAE + contrastive regularization machinery is identical between the two flavors. The flavor distinction lives entirely upstream — at the embedding-substrate choice.

Why one substrate isn't enough¶

Recsys surfaces have structurally different needs:

Substitution (cart replacement, out-of-stock fallback, reordering) needs tight similarity: a customer wants Pecorino Romano; the substitute pool should be other hard Italian cheeses, not Italian-style spreads. The match has to feel like "the same thing, different brand".
Discovery (homepage feeds, cross-sell, exploration) needs broader, more associative clusters: a customer who bought Parmigiano shouldn't only see other parmesan cheeses; they should see olives, tapenade, deli trays — products that "inhabit the same culinary universe" even if they're functionally different.

A single codebook would force a compromise: tight enough for substitution would lose discovery breadth; broad enough for discovery would loosen substitution. The two-flavor design avoids the compromise by maintaining two parallel codebooks and per-surface flavor routing.

How the two flavors get measured¶

The Instacart post evaluates both flavors via LLM-based cluster evaluation on three dimensions:

Dimension	ESCI	ESCI+Gemma
Functional coherence (substitute-axis)	Higher	Lower
Customer journey relevance (thematic)	Lower	Higher

Quote: "ESCI scores higher on substitutability; ESCI+Gemma excels at thematic coherence, matching their intended use cases."

This is the load-bearing validation that the design axis works as intended — the two flavors aren't just different in implementation, they produce measurably different cluster character that maps to the intended use cases.

The LLM-attribute-extraction step (discovery flavor)¶

The discovery flavor's distinctive upstream step is the LLM attribute-extraction preprocessing:

Run the product through Gemini Flash (~10× faster, ~5× cheaper than full-size Gemini).
Extract structured attributes — product type, key ingredients, dietary tags, format.
Strip marketing copy and ESCI-style metadata.
Embed the cleaned representation with Gemma (off-the-shelf).

The hypothesis the post tests:

"The goal is to test whether a general-purpose model, given cleaner inputs, can capture nuances that a domain-specific model misses."

The LLM-attribute step is what makes the off-the-shelf embedding model competitive: rather than making Gemma understand the noisy raw catalog text, the LLM does that work upstream and hands Gemma clean inputs. The pre-processing cost is bounded by Gemini Flash's disclosed efficiency (~10× faster, ~5× cheaper than full-size models).

Generalization¶

The two-flavor design generalizes to any recsys / retrieval substrate with surfaces that need different similarity character:

Music — substitute (similar artist) vs discovery (mood / occasion / playlist context).
Video — substitute (next episode / similar film) vs discovery (themed collection / cross-genre).
Text (Q&A retrieval) — exact-answer vs related-question.

What stays constant: the codebook training algorithm, the contrastive loss, the catalog-structure supervision. What changes: the upstream embedding's training objective.

Caveats¶

Two codebooks doubles the codebook-maintenance cost. Training cadence, eval cadence, version-stability discipline must be duplicated.
Per-surface flavor routing is a config decision with no disclosed tooling — how Instacart's surfaces declare flavor preference, default behavior, and migration between flavors is not stated.
Hybrid use-cases (precision + discovery) need explicit reconciliation. A homepage feed that occasionally wants substitution-quality recommendations would need both codebooks loaded; the post doesn't address mixed routing.
Beyond two flavors? The post stops at two. Whether more flavors (e.g. occasion-aware, dietary-constrained, brand-tier specific) would compose or require a different design isn't addressed.
Flavor-specific cold-start coverage equivalent? Both flavors inherit codebook coverage for new items, but do they handle sparse-text products equivalently? The Riesling and t-shirt failure cases were noted in general, not flavor-stratified.
Production routing strategy not specified. Does Instacart run two RQ-VAEs in parallel and switch per surface, or run one pipeline that produces both codebooks side by side?

Seen in¶

sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale — first canonical wiki disclosure: ESCI (precision) and ESCI+Gemma (discovery) as Instacart's two-flavor codebook design. Same RQ-VAE + contrastive loss; different upstream embedding. Validated by LLM-cluster-evaluation showing flavor character matches intended use cases.

concepts/semantic-id — the substrate the design axis applies to.
concepts/llm-based-cluster-evaluation — the metric that validates the flavor distinction.
concepts/contrastive-regularization-with-catalog-structure — the training-time mechanism shared across flavors.
systems/instacart-semantic-ids — production instance running both flavors.
systems/instacart-esci-model — the precision-flavor upstream embedding.
systems/gemma — the discovery-flavor embedding model.
systems/gemini — Gemini Flash for attribute extraction.
patterns/two-flavor-codebook-precision-vs-discovery — canonical pattern.
patterns/llm-attribute-extraction-before-embedding — the preprocessing the discovery flavor depends on.
patterns/rq-vae-codebook-as-product-vocabulary — the broader pattern this design axis fits within.