Skip to content

PATTERN Cited by 1 source

Embedding-based name resolution

Pattern

For a library whose symbol namespace churns (icons, emoji shortcodes, component library exports, translation keys, etc.), maintain an embedding of every live export name in a vector database. When the LLM emits a symbol that doesn't exist in the current library, resolve it to the nearest real export via embedding similarity and rewrite the reference to use that real symbol.

Canonical Vercel mechanism (verbatim)

"1. Embed every icon name in a vector database. 2. Analyze actual exports from lucide-react at runtime. 3. Pass through the correct icon when available. 4. When the icon does not exist, run an embedding search to find the closest match. 5. Rewrite the import during streaming."

(Source: sources/2026-01-08-vercel-how-we-made-v0-an-effective-coding-agent)

Worked example

Model emits:

import { VercelLogo } from 'lucide-react'

VercelLogo doesn't exist in lucide-react. Steps 2-4:

  • Runtime-check exports → VercelLogo absent.
  • Embed VercelLogo with the same embedding model used for the pre-indexed library corpus.
  • Nearest neighbour by cosine similarity → Triangle (the visual / semantic approximation a designer might use for Vercel's ▲ logo).
  • Rewrite to alias the real export:
import { Triangle as VercelLogo } from 'lucide-react'

The downstream JSX referencing <VercelLogo /> compiles unchanged because the local identifier is preserved.

Why embeddings over string-distance

For large symbol spaces, string-distance heuristics (Levenshtein, prefix-match) fail precisely where hallucination is most likely:

  • VercelLogo vs Triangle — string distance is maximal; embedding distance is modest (both are geometric-shape / brand-adjacent concepts).
  • ShoppingCart vs ShoppingBag — string distance is low; embedding distance is low; both approaches agree.
  • GitMerge vs GitBranch — string distance is low; embedding distance is low; agree.
  • FigmaLogo (hallucinated) vs … — embedding picks a plausible real logo; string-distance picks a random symbol whose spelling happens to overlap.

Embeddings encode semantic similarity in a way string distance can't; the model's hallucination is typically semantically-related-but-wrong (it has a concept; it's just picked a wrong name for it), so semantic nearest-neighbour is the right primitive.

Prerequisites

  1. Stable symbol embeddings. Pre-embed every current symbol with a consistent embedding model; re-embed on library release (for frequently-churning libraries like lucide-react, weekly).
  2. Runtime library-export analysis. You need the ground truth of what exists now; this can't come from the model's parametric knowledge (which is stale by assumption).
  3. Aliasing / renaming syntax in the host language. The rewrite pattern depends on the target language supporting import { Real as Hallucinated } or equivalent. TypeScript / JavaScript does; not every ecosystem does.

Latency profile

Per the Vercel disclosure:

  • <100 ms per substitution.
  • No further model calls — the rewrite is pure embedding lookup + runtime export check, both O(log N) or better on modest corpora.
  • Fits inside a token-stream rewrite pipeline without stalling the stream.

Generalisations

The pattern applies to any churning symbol namespace:

  • Icon libraries (canonical instance: systems/lucide-react).
  • Emoji shortcode sets (:raised-hands: vs :raising-hands: — semantically close, one might not exist in a given set).
  • Component library exports (Material UI, Chakra UI, etc., where similar-looking components churn on versions).
  • Translation keys — LLM may emit a plausible translation-key string; resolve to the nearest real key in the translation-bundle.
  • API endpoint names — where the API surface has many similarly-shaped endpoints.

Boundary: when NOT to use

  • Non-symbol text. This is a pattern for identifiers — things that are either right or wrong with no graceful degradation. Prose text should not be embedding-rewritten.
  • Where wrong-but-plausible is worse than explicit error. Medical / legal / financial APIs where a silent substitution could be harmful. In those cases, emit an explicit error rather than a plausible substitution.
  • Namespaces that don't churn. For stable APIs, a periodically-regenerated static-analysis check is cheaper than maintaining embeddings.

Seen in

Last updated · 476 distilled / 1,218 read