Skip to content

PATTERN Cited by 1 source

LLM extraction cache by similarity

Intent

Avoid re-running expensive LLM extractions for items that are "similar enough" to ones already extracted, by maintaining a cache keyed by a similarity function (not an exact-match hash) and returning the cached extraction when a new item hits.

This is one variant of LLM approximation — trading some quality risk (the similar item might legitimately have a different attribute value) for large cost savings on catalogs where many items share attribute values (same brand + flavor, same pack size, same "organic" status).

When to use

  • Large catalog / document corpus with significant redundancy — many items share the same attribute value because they're variants of the same underlying thing (pack sizes, colors, flavors of one product family).
  • Per-extraction LLM cost is high relative to per-hit cache cost.
  • Occasional cache-false-positive (wrong attribute served from cache) is tolerable OR can be corrected via the production random-sample HITL loop.

Mechanism

extract_with_cache(item, attribute):
    neighbor = similarity_search(item, cache[attribute])
    if neighbor and similarity(item, neighbor) > τ:
        return cache_value(neighbor)
    else:
        v = LLM.extract(item, attribute)
        cache[attribute].insert(item, v)
        return v

The similarity function is the load-bearing component. Candidates:

  • Embedding-nearest-neighbor on product title / description + image embeddings.
  • Duplicate-product-detection classifier (Instacart explicitly flags "ongoing work in duplicate product detection" as the similarity signal).
  • Schema-aware similarity: same brand + same product family + same flavor → same sheet_count (these attributes are deterministic under that equivalence).

Why it's hard

The correctness bar is "same attribute value for similar items", not "similar items" on general features. Two products can be embedding-similar (same category, same brand family) and have different attribute values (one is sugar-free, one isn't) — the cache would serve wrong values. The post acknowledges this:

"For this approach to succeed, we will need to define a similarity function that is able to help determine if two products have the same attribute values. This will be a challenging problem but there is ongoing work in duplicate product detection that we can take advantage of."

Generic similarity (e.g. product embedding cosine- similarity) is unsafe on attribute-dependent axes. You need a similarity function that's attribute-aware — or a similarity function so strict that the hit rate drops to near-exact-duplicates only.

Cost ceiling: what fraction of the catalog can the cache hit?

Hit rate is bounded by the catalog's attribute-value redundancy — how many SKUs share an attribute value? For many e-commerce attributes this is high (thousands of SKUs are "organic: true"), but the similarity axis that identifies them isn't available a priori. You're trading cache-miss LLM cost for cache-build-and-maintenance cost (embedding pipeline, similarity index, cache freshness).

Tradeoffs / gotchas

  • Wrong cache hits are invisible failures. If the similarity function has a false-positive rate of even 1%, a catalog of 10M SKUs ships 100K wrong attribute values with no flagging — no LLM call, no confidence score. Pair with a random-sample HITL audit (patterns/human-in-the-loop-quality-sampling).
  • Cache staleness. If the LLM / prompt gets a quality improvement, cached values from the old prompt are retained — quality drift, silently. Stamp cache entries with prompt-version + model-version and invalidate on change.
  • Cold-start catalogs don't benefit. A cache pays off after enough items have been extracted that similar items start colliding; a new catalog has ~0% hit rate.
  • Attribute interaction. Caching per attribute is straightforward; caching an "extraction result for all N attributes of this product" is one cache entry only if every attribute is independently cache-hittable, which is rarely true.
  • Embedding drift. If you change the embedding model, all cache keys invalidate — plan re-index and transition path.
  • Cheaper than cache: cascade + batching first. For many catalogs, concepts/llm-cascade and patterns/multi-attribute-multi-product-prompt-batching reduce cost more reliably, with less quality risk. Cache-by-similarity is the advanced move after those.

Seen in

  • sources/2025-08-01-instacart-scaling-catalog-attribute-extraction-with-multi-modal-llms — named as future work for PARSE's cost-reduction roadmap. Specifically described as "LLM approximation" following the AutoMix reference [2]: "Another idea of avoiding redundant LLM prompt processing is to ensure we only ask the LLM to process completely new products. To accomplish this, we will first store all previous attribute extraction results in a cache. Then to extract an attribute for a new product, we first verify if the attribute has been extracted for a similar product previously. If not, we query the LLM as before. But if yes, the extraction result will be retrieved from the cache and returned to save the cost." Explicitly flags duplicate-product-detection as the similarity-function direction.
Last updated · 319 distilled / 1,201 read