Skip to content

PATTERN Cited by 2 sources

Selective indexing heuristics

Selective indexing heuristics is the pattern of applying rule-based filters to shrink the set of documents/frames/records that make it into a search index, rather than indexing everything on production. The filters encode domain assumptions about what constitutes search-worthy content vs noise, and their goal is to cut cost and improve result quality simultaneously — a smaller, higher-signal corpus tends to both cost less to host/query and return better top-K.

Intent

Two forces push back on "index everything":

  • Economics. Vector-backed indexes + ANN structures + ranker features have real per-document cost. For a corpus whose distribution has a long tail of low-quality / near-duplicate / WIP content, the tail blows the budget without contributing trustworthy results.
  • Quality. Scale alone is not quality. Duplicates crowd top-K; stale / abandoned designs (Figma "Graveyard" pages; dead project notes; throwaway files) pollute results even when correctly retrieved; copy-of-copy files saturate results with near-identical hits.

Solution: treat what is indexed as an explicit design surface. Author heuristics that filter the ingest queue at enqueue time, and keep the heuristics observable so they can evolve.

Mechanism (Figma realization)

Figma AI Search stacks four heuristics (Source: sources/2026-04-21-figma-how-we-built-ai-powered-search-in-figma):

Heuristic What it filters Rationale
Top-level frame + UI-shape dimensions Only top-level frames that look like UI designs via common UI frame dimensions Most non-UI frames are scratchpad / annotations / offcuts
Non-top-level exception Nested frames meeting "the right conditions" Designers organise UI work in sections / nested frames — the naive "top-level only" rule misses real UI
Duplicate collapsing Only one of each near-duplicate frame Designers riff via duplicate-and-tweak; indexing every dup just crowds top-K
File-copy skipping Skip unaltered copies of files entirely Designers frequently copy files wholesale; unaltered copies add cost, no new signal

Plus an experimental fifth:

  • Quality signals (experimental) — e.g. frame marked ready for development as an explicit indexable-vs-not signal or ranker input. Figma's article explicitly flags this as "still experimenting" rather than shipped policy.

Notably: the heuristics combine shape (dimensions), structural position (top-level vs nested), content relationships (near-duplicate), and explicit user signal (ready-for-dev tag) — four different axes the indexing policy layers together.

Why it beats "just index everything"

Figma's framing:

"We quickly learned we couldn't index and search everything — it would be too costly."

The framing matters: the question isn't "is there content to index?" — it's "is the marginal document's contribution to top-K quality worth its index/query cost?" For most non-UI frames, duplicates, and file copies, the answer is no.

Contrast with a naive crawl-everything strategy:

Index everything Selective heuristics
Cost All corpus × index cost Filtered corpus × index cost (much smaller)
Top-K quality Degraded by duplicates / scratchpad Cleaner; fewer near-duplicates in top results
Stale-content leakage High (WIP + archives both indexed) Low (policy + patterns/edit-quiescence-indexing)
Ranker burden Ranker must learn to suppress noise Noise pre-filtered; ranker focuses on relevance

Key design questions

When applying this pattern:

  1. What domain-specific shape signals exist? (Frame dimensions? File-extension whitelist? Message-length threshold? Structural- role tags?) These are the cheapest filters.
  2. What relational signals exist? (Near-duplicate detection, copied-file identification, fork-of-file relationships.) More expensive to compute but often high-leverage because they prune whole classes at once.
  3. What user-authored signals exist? (Ready-for-dev flag; "archive" tag; --search: false frontmatter.) These rely on users actually setting the signal; test coverage before depending on them in ranking.
  4. What's the eval signal that a new heuristic is a net win? A heuristic can quietly degrade recall on rare-but-important queries. Treat every new filter as a change the eval pipeline should catch (see patterns/visual-eval-grading-canvas, concepts/relevance-labeling).

Relationship to other patterns

  • patterns/edit-quiescence-indexing — the time-based complement: filters by "has been stable long enough." Combines naturally with content-based selective heuristics — both say "not yet / not ever" to certain documents.
  • patterns/multimodal-content-understanding (Dropbox Dash) — the per-modality-routing complement: filters how content is extracted rather than whether to index at all. Orthogonal axis.
  • concepts/relevance-labeling — the eval mechanism that keeps selective-indexing heuristics honest; without eval, a filter can silently remove the one result the user wanted.

Caveats

  • Heuristics drift. UI-frame dimensions, duplicate-detection thresholds, and file-copy identification all depend on assumptions about how users structure their work. Behaviour changes over time. Revisit.
  • Long-tail queries. Selective indexing by definition excludes some content — so some rare queries will have no right answer at all. Design the product around this (e.g. "no results found — consider broadening" UI rather than a silent empty page).
  • Experimental signals ≠ production. Figma explicitly flags "quality signals" (e.g. ready-for-dev) as experimental. Don't cite aspirational mechanisms as shipped behaviour.
  • Scope creep risk. The set of heuristics tends to grow; each new one adds surface area to test, debug, and eval against. Sunset heuristics that don't demonstrably improve the pipeline.

Seen in

  • sources/2026-04-21-figma-how-we-built-ai-powered-search-in-figma — four stacked heuristics (UI-frame dimensions + non-top-level exception + duplicate collapsing + file-copy skipping) plus experimental ready-for-dev quality signal; explicit corpus- reduction framing "couldn't index everything — too costly."
  • sources/2026-04-21-figma-the-infrastructure-behind-ai-search-in-figmaquantifies the corpus reduction: removing draft files, within-file duplicate designs, and unmodified file copies cut the index in half. Notable that "not surfacing duplicate designs within files is a nice user experience improvement" too — corpus reduction doubles as a product improvement, not just a cost lever. OpenSearch memory was the second-biggest cost driver; halving the corpus was the first tool pulled (before vector quantization).
Last updated · 200 distilled / 1,178 read