CONCEPT Cited by 1 source
Incremental indexing¶
Incremental indexing is the discipline of processing just the changes to a codebase when updating a code index — target complexity O(changes) instead of O(repository). In practice, the realistic floor for most languages is O(fanout): the set of files transitively affected by a change, not just the changed files themselves.
"As the codebase grows, and the rate of change of the codebase increases (a monorepo suffers from both of these problems) we find that we can't provide up-to-date information about the latest code because indexing the entire repository can take a long time. The index is perpetually out of date, perhaps by many hours. The solution to this scaling problem is to process just the changes." (Source: sources/2025-01-01-meta-indexing-code-at-scale-with-glean.)
Why full reindexing fails at scale¶
Two feedback loops kick in at monorepo scale:
- Size — more code to analyse → longer per-run indexing time.
- Rate of change — more commits per hour → shorter available window between commits.
Above a threshold, full-reindex throughput falls below commit rate and the index is "perpetually out of date." The reindexing frontier moves slower than the code does.
Target: O(changes)¶
The desirable complexity: indexing cost proportional to the size of the change, not the size of the repo. A 5-line commit should cost ~5 lines of indexing work, regardless of whether the repo has 10⁶ or 10⁹ lines.
Realistic floor: O(fanout)¶
For most programming languages, O(changes) is unreachable because a change can have semantic reach beyond the changed files:
"in C++ if a header file is modified, we have to reprocess every source file that depends on it (directly or indirectly). We call this the fanout. So in practice the best we can do is O(fanout)." (Source: sources/2025-01-01-meta-indexing-code-at-scale-with-glean.)
Language-specific fanout definitions:
| Language | Canonical fanout-driver |
|---|---|
| C / C++ | Transitive #include-ers of a changed header |
| Java | Classes referencing a changed public type |
| TypeScript | Modules importing the changed module |
| Rust | Crates/modules with use of the changed item |
| Python | Modules importing the changed module |
The fanout can be large (header widely included, public API changed, language-root changed) or small (private method changed), so incremental-indexing cost is bounded by fanout but not proportional to commit size alone.
Computing the fanout¶
Glean computes fanout with the same query engine that serves the index:
"Interestingly the fanout can often be obtained using Glean queries:
for example for C++, the fanout is calculated by finding all the files
that #include one of the changed files, and then repeating that
query until there are no more files to find." A fixpoint query over a
transitive-reference predicate. See
Angle for the query language itself.
Representation challenge: multiple revisions¶
Incremental indexing doesn't just care about what to reprocess — it also needs to represent changes without blowing up storage:
"We don't want to destructively modify the original data, because we would like to be able to provide data at multiple revisions of the repository, and to do that without storing multiple full-sized copies of the data. So we would like to store the changes in such a way that we can view the whole index at both revisions simultaneously."
Glean's solution: stacked immutable databases — layered, non-destructive additions / hides on top of a base, queried as a single logical database.
Cost-model framing¶
Three cost axes to track for any incremental-indexing design:
- Per-change cost — indexing time as a function of change shape + fanout. Ideal: bounded by fanout, not repo size.
- Query cost — does answering a query get more expensive as layers accumulate? Stacked databases pay query-time for cheap write-time.
- Storage cost — revision deltas must stay sublinear in number of revisions; full-snapshot-per-revision is non-viable at Meta scale.
Seen in¶
- sources/2025-01-01-meta-indexing-code-at-scale-with-glean — the
canonical wiki reference for the O(changes) → O(fanout)
framing, C++
#includefanout as concrete instance, and multi-revision representation constraint.
Related¶
- systems/glean — canonical wiki incremental-code-indexing instance.
- concepts/stacked-immutable-databases — Glean's representation substrate.
- concepts/code-indexing · concepts/monorepo
- concepts/fanout-and-cycle — the "fanout" term also appears in the gossip-convergence context; Glean's use is the transitive-dependency variant.