Skip to content

CONCEPT Cited by 1 source

Similarity-depth correlation

Definition

Similarity-depth correlation is an intrinsic-evaluation metric for a hierarchical-codebook Semantic ID system: the Spearman correlation between the continuous embedding similarity of a pair of items and the number of shared codebook levels in their discrete codes. High correlation means the codebook hierarchy faithfully reflects the underlying embedding similarity structure; low correlation means the discretization has corrupted the similarity signal.

Quote (Source: sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale):

"We measure the relationship between embedding similarity and shared semantic ID levels. Correlations of 0.69–0.84 confirm the semantic id hierarchy captures meaningful structure. Among highly similar pairs (≥0.9 cosine similarity), 98–99% share Level 1, declining to 18–37% at Level 4. This is expected, since Level 4 distinguishes between very similar products."

What it measures

For each pair of items (a, b) in a sampled set:

  1. Compute continuous embedding cosine similarity cos(e_a, e_b) from the upstream embedding (e.g. ESCI or ESCI+Gemma).
  2. Compute shared SID prefix depth = the number of leading codewords that match between SID(a) and SID(b) (0–K, where K is the codebook depth, typically 4).
  3. Compute Spearman correlation between the two columns across pairs.

What good values look like

Instacart reports Spearman correlations of 0.69–0.84 across their production codebooks. The post does not disclose which is the ESCI vs ESCI+Gemma value but the range is consistent across both flavors.

A stratified breakdown for very-similar pairs (≥0.9 cosine similarity):

Codebook level % pairs sharing this level
L1 98–99%
L2 (interpolated)
L3 (interpolated)
L4 18–37%

The expected shape is decreasing fraction with depth: most similar pairs land in the same coarse neighborhood (L1) but are distinguished at the finest level (L4) — that's the point of the hierarchy. If the curve is flat (low L1 sharing) or inverted (high L4 with low L1 sharing), the codebook is producing pathological codes.

Why this metric exists

Discrete-code substrates (Semantic IDs, BPE tokens, audio codes) have an evaluation gap: downstream task metrics (recall@k, add-to-cart rate, perplexity) measure end-to-end performance but mask systematic quality problems in the code substrate itself. A codebook can be producing bad codes that the consuming model partially compensates for, hiding the substrate quality issue until some downstream change exposes it.

The Instacart post is explicit:

"Evaluate codes directly. Downstream metrics can mask systematic quality problems. Intrinsic evaluation catches issues before they compound."

Similarity-depth correlation catches a specific class of failure: the codebook losing the embedding-space neighborhood structure during quantization (e.g. due to dead codewords, codebook collapse, or insufficient capacity).

Failure mode it surfaces

The Instacart post documents two divergent-code failure cases that intrinsic evaluation surfaced:

Pair Cosine sim Shared prefix
Two Riesling wines (0_19_52_63 vs 0_31_52_88) 0.86 L1 only (mismatched at L2)
Team t-shirt vs generic team apparel (1_19_21_20 vs 1_7_41_59) 0.95 L1 only

In both cases the products had very high embedding similarity but diverged early in the codebook hierarchy. Root cause for both: sparse text inputs (one had detailed descriptions, the other only four words); the upstream embedding placed them near each other but the codebook quantization split them. Quote:

"sparse or inconsistent text leads to degraded embeddings, which lead to divergent codes."

The metric does not directly point at the root cause, but the divergent pairs surface as outliers in the correlation distribution — a starting point for human investigation.

Relationship to other metrics

Intrinsic codebook evaluation has at least three distinct flavors; similarity-depth correlation is one. The Instacart post discloses all three running in their evaluation suite:

  • Similarity-depth correlation (this concept) — quantitative faithfulness of the hierarchy to the upstream embedding.
  • LLM-based cluster evaluation — qualitative scoring of leaf groups on functional coherence + purchase likelihood + customer journey relevance.
  • Taxonomy alignment — whether products sharing L1 share a top-level category (and the disagreements are studied as potential catalog audit signals via concepts/code-vs-label-mismatch-as-catalog-audit).

The three are complementary: one measures hierarchy faithfulness, one measures business semantics, one measures catalog-tree alignment. Together they form the intrinsic evaluation of discrete codes pattern.

Caveats

  • Sampling methodology is critical. Pair sampling skewed toward popular categories will give different correlations than balanced sampling. The post does not disclose the sampling protocol.
  • Spearman vs Pearson choice. Spearman is rank-based and robust to non-linear relationships, which fits the discrete-prefix-depth output (a step function). Pearson would be sensitive to the step-function shape.
  • Interpretation of 0.69–0.84. The post asserts these values "confirm the semantic id hierarchy captures meaningful structure" but doesn't supply baselines (random codebook? VQ-VAE baseline? prior Instacart system?).
  • Intrinsic ≠ downstream. A high similarity-depth correlation doesn't guarantee good downstream retrieval performance — it just rules out one class of substrate failure.
  • Doesn't catch label-vs-code disagreement. If the upstream embedding already encodes a wrong taxonomy (e.g. sparse text → bad placement), similarity-depth correlation can be high while the codes are still functionally wrong. Cluster-LLM-eval and taxonomy-alignment cover that gap.

Seen in

  • sources/2026-06-02-instacart-semantic-ids-product-understanding-at-scale — first canonical wiki disclosure: Spearman 0.69–0.84 between embedding cosine similarity and shared SID levels for Instacart's production RQ-VAE codebooks; ≥0.9-cosine pairs share L1 at 98–99% declining to 18–37% at L4; surfaced two divergent-code failure cases (Riesling, team apparel) with sparse-text root cause.
Last updated · 542 distilled / 1,571 read