Skip to content

PATTERN Cited by 1 source

Centralized ahead-of-time indexing

Pattern: shift code indexing off the developer's machine and off IDE startup paths, run it on a shared fleet ahead of time, and expose the resulting index to clients over the network through a query interface.

Canonical wiki instance: Meta's Glean (Source: sources/2025-01-01-meta-indexing-code-at-scale-with-glean).

Problem

IDE-local code indexing doesn't scale once a codebase crosses from "project" to "monorepo":

  • IDE startup time degrades — the IDE must analyse every dependency it might show go-to-def / find-references for. C++ is the worst case because of compile-time parsing cost.
  • Each developer pays the indexing cost on their own machine.
  • The index is bounded by one machine's disk + RAM.
  • Multiple developers indexing the same codebase wastes CPU fleet-wide.

From the Meta post:

"With a larger codebase and many developers working on it, it makes sense to have a shared centralized indexing system so that we don't repeat the work of indexing on every developer's machine. And as the data produced by indexing can become large, we want to make it available over the network through a query interface rather than having to download it." (Source: sources/2025-01-01-meta-indexing-code-at-scale-with-glean.)

Shape

┌─────────────────┐     ┌─────────────────────────┐
│ Indexer fleet   │────▶│ Replicated fact DBs     │
│ (per-language)  │     │ (network-queryable)     │
└─────────────────┘     └────────────┬────────────┘
                         Clients issue queries:
                    ┌────────────────┼───────────────┐
                    ▼                ▼               ▼
              ┌─────────┐      ┌────────────┐    ┌──────────┐
              │ Code    │      │ Code       │    │ IDE      │
              │ browser │      │ review     │    │ augment  │
              └─────────┘      └────────────┘    └──────────┘

Four properties the pattern forces:

  1. Indexing is parallelisable. "Indexing can be heavily parallelized and we may have many indexing jobs running concurrently."
  2. Query service is distributed. "The query service will be widely distributed to support load from many clients that are also distributed."
  3. Databases are replicated. "The databases will be replicated across the query service machines and also backed up centrally."
  4. Network-first query interface. Clients don't download the index; they ask questions of it.

What the pattern unlocks

  • Fleet-wide code navigation beyond any one IDE's working set — "Glean allows you to, for example, find all the references to a function, not just the ones visible to the IDE. This is particularly useful for finding dead code, or finding clients of an API that you want to change."
  • Cross-language navigation. RPC + FFI jumps to the target-language definition because every language's index is in the same service.
  • Instant-start navigation anywhere code appears — code browser, code review tool, documentation UI, not just IDE.
  • Ad-hoc queries that no IDE would ever expose — build-dependency graph analysis, dead-code detection, API-migration progress, test-coverage / test-selection, automated data removal, RAG for AI assistants.

Constraints the pattern imposes

  • Index freshness — the index is a snapshot lag behind HEAD. Drives the need for incremental indexing so the lag stays bounded.
  • Revision-aware serving — different clients may want different revisions (review tools need the diff revision; browsers want HEAD). Drives multi-revision serving.
  • API + format neutrality at the query layer — if you want many clients, you cannot tie the query surface to one IDE's feature set. Glean's response is a declarative query language (Angle) + a symbol-server layer (Glass) that exposes language-neutral views.
  • Indexer-fleet scheduling — prioritising fresh-commit indexing vs background reprocessing becomes its own system-design problem (not deeply covered in the 2024-12-19 post).

Variants

  • Whole-repo AOT, periodic rebuild. Simplest form; the index lives across rebuilds. Doesn't scale at high commit rate.
  • Whole-repo AOT + incremental updates. Glean's operating point — the base index plus incremental layers keep the perceived lag low. See concepts/incremental-indexing.
  • Centralized but on-demand. A build-time / CI-time indexer runs per commit and pushes results; code-browser reads lazily. Simpler but loses cross-revision queryability.
  • IDE augmentation (hybrid). The IDE queries the centralized index at startup for fleet-wide results, then layers in the file-local analysis from its own language server as files load. Meta's VS Code C++ extension is the canonical wiki instance: "As the IDE loads the files the developer is working on, the C++ language service seamlessly blends the Glean-provided data with that provided by the native clangd backend."

Seen in

Last updated · 319 distilled / 1,201 read