PATTERN Cited by 1 source

Intrinsic evaluation of discrete codes¶

Pattern¶

Evaluate discrete-code substrates (e.g. Semantic IDs, BPE token spaces, audio codebooks, learned vector quantizations) directly on the codes themselves — not just via downstream task metrics. Intrinsic evaluation catches systematic substrate-quality problems that downstream metrics mask.

Quote (Source: sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale):

"Evaluate codes directly. Downstream metrics can mask systematic quality problems. Intrinsic evaluation catches issues before they compound."

Why intrinsic evaluation is necessary¶

Discrete-code substrates have an evaluation gap. Downstream metrics (recall@k, add-to-cart rate, perplexity, BLEU) measure end-to-end performance but mask substrate-quality problems:

A codebook can produce bad codes that the consuming model partially compensates for, hiding the substrate issue.
The compensation isn't free — it costs model capacity that could have been used for the actual prediction task.
The hidden substrate problem can resurface later: a model retrain, a downstream change, or a shift in input distribution exposes the masked failure.

The Instacart post is explicit:

"Quantitative metrics tell us whether the hierarchy is structurally sound, but not whether the clusters make functional sense."

Intrinsic evaluation has two purposes:

Catch substrate failures early — before the consuming model masks them.
Discriminate between substrate variants — pick the right codebook for the right surface.

The Instacart triplet¶

Instacart runs three intrinsic metrics in parallel:

1. Quantitative: similarity-depth correlation¶

Spearman correlation between embedding similarity and shared-codebook-prefix depth. Measures hierarchy faithfulness — does the codebook respect the upstream embedding's neighborhood structure?

Instacart's production codebooks: Spearman 0.69–0.84 across pairs. ≥0.9-cosine pairs share L1 at 98–99%, declining to 18–37% at L4 (expected hierarchy shape). Outliers in the correlation surface sparse-text divergent-code failures (Riesling wines at 0.86 similarity but L1-only match; team apparel at 0.95 similarity but L1-only match).

2. Qualitative: LLM-based cluster evaluation¶

LLM-judges score each leaf cluster on three dimensions:

Functional coherence (substitute-axis: do these products serve similar purposes?)
Purchase likelihood (co-purchase axis: would a customer buy these together?)
Customer journey relevance (context axis: do they fit the same shopping context?)

Used to discriminate between flavors: ESCI scores higher on substitutability; ESCI+Gemma scores higher on thematic coherence — matching their intended use cases.

3. Structural: taxonomy alignment¶

Compare codes to the existing taxonomy. Products sharing L1 should usually share a top-level category. Disagreements are investigated as either codebook-failure signal or catalog-audit signal — when the codebook is right and the label is wrong.

Why three angles, not one¶

The three metrics measure complementary properties:

Metric	Catches	Misses
Similarity-depth correlation	Substrate hierarchy faithfulness; codebook collapse; dead codewords	Business-meaning failures (codes can be hierarchy-faithful but functionally wrong)
LLM cluster evaluation	Cluster business meaning; flavor character	Quantitative substrate-collapse cases; LLM judge bias
Taxonomy alignment	Cross-validation against external structure; catalog mislabels	Errors in the taxonomy itself become signal noise

Together they form a defense-in-depth evaluation: a problem that slips past one metric is likely caught by another. Substrate failures that pass all three intrinsic metrics are highly likely to also work downstream.

Generalization beyond Instacart¶

The pattern applies wherever discrete-code substrates are used:

Substrate	Quantitative intrinsic	Qualitative intrinsic	Structural intrinsic
Semantic IDs (recsys)	Similarity-depth correlation	LLM cluster eval	Taxonomy alignment
BPE / tokenizer vocabulary	Compression ratio; token entropy	Human linguistic-validity check	Morphological alignment
Audio codebooks (VQ-VAE for speech)	Reconstruction loss; perplexity	Listening tests	Phoneme alignment
Image codebooks (VQ-VAE-2, dVAE)	Reconstruction FID; codebook usage	Human eval of generated images	Object-category alignment
Knowledge graph entity codes	Distance preservation	LLM relationship scoring	Class-hierarchy alignment

The substrate-agnostic insight: discrete-code substrates need direct evaluation along quantitative + qualitative + structural axes, not just downstream task metrics.

When this pattern doesn't fit¶

Embedding-based recsys (no discrete codes) — there's no code substrate to evaluate intrinsically; downstream metrics are the only available signal.
Black-box external substrates — when the codebook is provided by an external service with no inspection access.
Domain without business-meaning ground truth — qualitative evaluation (LLM-as-judge or human review) requires some way of judging "right" clusters; in pure-engineering substrates this may be unavailable.

Caveats¶

Intrinsic ≠ downstream. Strong intrinsic-eval scores don't guarantee strong downstream results. The pattern catches one class of substrate failure, not all classes.
LLM-judge cost scales linearly with cluster count; sampling protocols matter for very-many-cluster codebooks.
Domain-specific dimension design — Instacart's three LLM-eval dimensions are grocery-recsys-specific. Other domains need domain-specific dimension design.
Taxonomy-alignment is itself imperfect. The taxonomy can be wrong (the Instacart audit pipeline finds catalog mislabels); treating taxonomy alignment as a quality metric requires acknowledging the alignment-failures-as-audit duality.
No publicly disclosed thresholds — Instacart reports similarity-depth correlations of 0.69–0.84 as evidence of good hierarchy, without a baseline (random codebook? VQ-VAE baseline?). Calibrating thresholds for new applications is domain-specific.

Seen in¶

sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale — canonical wiki instance: Instacart's three-metric intrinsic evaluation suite for SID codebooks (similarity-depth correlation
LLM cluster eval + taxonomy alignment). Quote: "Evaluate codes directly. Downstream metrics can mask systematic quality problems. Intrinsic evaluation catches issues before they compound."

concepts/similarity-depth-correlation — quantitative axis.
concepts/llm-based-cluster-evaluation — qualitative axis.
concepts/code-vs-label-mismatch-as-catalog-audit — structural axis (and its dual: code-vs-label disagreements as audit signal).
concepts/semantic-id — the substrate this evaluates.
concepts/llm-as-judge — broader pattern family for the qualitative axis.
systems/instacart-semantic-ids / systems/rq-vae — production substrate + algorithm.
patterns/rq-vae-codebook-as-product-vocabulary — broader vocabulary-substrate pattern this evaluates.
patterns/semantic-code-as-catalog-audit — the structural-axis dual.