CONCEPT Cited by 1 source
Code indexing¶
Code indexing is the job of collecting structured facts from source code so that developer tools can answer questions efficiently — questions like "Where is the definition of MyClass?" or "Which functions are defined in myfile.cpp?" (Source: sources/2025-01-01-meta-indexing-code-at-scale-with-glean).
What tools rely on a code index¶
From Meta's 2024-12-19 Glean post:
- Code navigation — "Go to definition" in an IDE or code browser.
- Code search — symbol search across an entire repo.
- Automatically-generated documentation — extract API structure + doc comments, render hyperlinked doc pages.
- Code analysis — dead-code detection, lint, complexity metrics, API migration tracking, test selection, test-coverage tools.
- AI coding assistants — RAG over a code index (named in the post).
IDE-local vs centralized ahead-of-time¶
Two architectural archetypes:
- IDE-local indexing. The IDE indexes the files you open as you open them ("An IDE will typically do indexing as needed, when you load a new file or project for example"). Works well for small or medium projects; breaks down at monorepo scale. "The larger your codebase, the more important it becomes to do code indexing ahead of time. For large projects it becomes impractical to have the IDE process all the code of your project at startup and, depending on what language you're using, that point may come earlier or later: C++ in particular is problematic due to the long compile times."
- Centralized AOT indexing. A shared, centralized service indexes the whole repo ahead of time, across a replicated query service; clients fetch answers over the network. Canonical wiki instance: Meta's Glean. See patterns/centralized-ahead-of-time-indexing for the pattern.
Why centralize at scale¶
"With a larger codebase and many developers working on it, it makes sense to have a shared centralized indexing system so that we don't repeat the work of indexing on every developer's machine. And as the data produced by indexing can become large, we want to make it available over the network through a query interface rather than having to download it." — the wiki's canonical framing of the IDE-local → shared-service transition.
Architectural consequences:
- Indexing is heavily parallelisable → runs on a fleet.
- Query service scales independently → network-attached, widely distributed.
- The index itself is data → replicated across query machines + backed up centrally.
- Multiple revisions must be queryable simultaneously → drives incremental indexing + revision stacking.
What a code index stores¶
Not fixed. Meta's Glean deliberately "doesn't decide for you what data
you can store" — each language gets its own schema (e.g. a C++
FunctionDeclaration predicate has name, namespace, location
fields). Generic code-nav tools need at minimum:
- Symbol declarations (classes, functions, variables) with locations (file + span).
- References (every usage site of each symbol).
- Types / signatures / inheritance (for documentation and type-on-hover).
- Doc comments extracted via the language's convention.
Beyond that, language-specific detail unlocks analyses the generic layer
can't — e.g. C++ dead-using detection "requires the data to include
some C++-specific details, such as which using statement is used to
resolve each symbol reference."
Format + language neutrality¶
Two dimensions of neutrality matter:
- Format-level neutrality — a published interchange format anyone can produce/consume. Example: LSIF in the LSP ecosystem.
- Query-level neutrality — a query language that can expose language-agnostic views over language-specific facts. Glean does this via schema-level derived predicates in Angle — see patterns/language-neutral-schema-abstraction.
Glean's bet (explicit in the post): you want language-specific data (so you can ask language-specific questions), but you want language-neutral views available on top (so cross-language tooling doesn't break). You don't have to pick.
Beyond navigation¶
Code indexing also powers:
- Diff-level analysis via diff sketches — see patterns/diff-based-static-analysis.
- Dead-code detection (e.g. Meta's Automating dead code cleanup).
- API-migration progress tracking.
- Test-coverage + test-selection.
- Automated data removal.
- RAG in AI coding assistants.
This is the Glean post's load-bearing claim for generality as a design choice: most of these emerged after the initial code-navigation bet.
Seen in¶
- sources/2025-01-01-meta-indexing-code-at-scale-with-glean — the canonical wiki statement of the IDE-local-to-centralized shift, the list of code-indexing consumers, and the generality-over-use-case-fit argument.
Related¶
- systems/glean · systems/angle-query-language · systems/glass-symbol-server · systems/lsif
- concepts/symbol-id · concepts/incremental-indexing · concepts/stacked-immutable-databases · concepts/diff-sketch
- concepts/monorepo · patterns/centralized-ahead-of-time-indexing · patterns/language-neutral-schema-abstraction