PATTERN Cited by 1 source
Diff-based static analysis¶
Pattern: run the code indexer on each diff (pull request) to produce a machine-readable summary of the change (" diff sketch"), then fan that summary out to multiple downstream consumers — static-analysis checks, lint rules, commit-level notifications, semantic commit search, and code-review-time navigation.
Canonical wiki instance: Meta's Glean → Phabricator pipeline (Source: sources/2025-01-01-meta-indexing-code-at-scale-with-glean).
Problem¶
Code review and pre-commit analysis want semantic context, not just textual context:
- "This diff removes the last call site of function X" — needs repo-wide reference data.
- "This diff introduces a new call to a deprecated API" — needs semantic API knowledge, not regex.
- "Jump to the definition of this symbol in the diff view" — needs index-backed navigation.
- "Connect this production stack trace to recent commits that touched these functions" — needs cross-commit semantic search.
Running the full repo-level analysis for every diff is prohibitively expensive. Pure textual diffs don't carry enough semantic signal.
Shape¶
Diff ──▶ Incremental indexer ──▶ Diff sketch ──┬─▶ Static analysis
├─▶ Semantic lint
├─▶ Rich notifications
├─▶ Commit-level search
└─▶ Review-time nav (Phabricator)
Three-stage pipeline:
- Index the diff. The indexer runs on the changeset — not the whole repo — producing a per-diff delta on top of the base index. Complexity is O(fanout) per diff, not O(repo).
- Extract the diff sketch. Summarise what the diff did to the code graph: classes / methods / fields introduced or removed, calls added or removed, inheritance edges changed. All entities keyed by stable symbol IDs so downstream consumers can join against the base index.
- Route to consumers. Each consumer subscribes to the sketch and makes its own decisions.
Consumers named in the Meta post¶
From the 2024-12-19 post:
- Static analysis. "Diff sketches are used to drive a simple static analysis that can identify potential issues that might require further review."
- Non-trivial lint rules. Rules that depend on semantic context — e.g. "last caller of function X removed" — that a text-level linter can't express.
- Rich notifications. Teams subscribe to semantic events ("notify me when anyone modifies function Y") rather than path patterns.
- Semantic commit search. "Connecting a production stack trace to recent commits that modified the affected function(s), to help root-cause performance issues or new failures." — headline application.
- Review-time code navigation. Accurate go-to-definition, type-on-hover, and documentation rendered on the diff inside Meta's code-review tool (Phabricator). "This is a powerful lift to the code review process, making it easier for reviewers to understand the changes and provide valuable review feedback." Meta's Glean-powered review-time nav covers C++, Python, PHP, JavaScript, Rust, Erlang, Thrift, Haskell.
Why "sketch" instead of raw index diff¶
Two reasons the consumer pipeline benefits from a pre-aggregated sketch rather than raw layered index output:
- Uniform consumer interface. Each consumer gets a structured list of semantic changes; it doesn't have to diff two index revisions itself.
- Consumer-scoped projection. Different consumers need different slices (review-nav needs span-level detail; lint needs entity-level adds/removes). A sketch is a natural projection point.
Trade-offs¶
- Sketch schema governance. Cross-consumer agreement on what the sketch contains is its own design problem. Adding a new kind of semantic-change affects every consumer.
- Commit-rate back-pressure. A high monorepo commit rate plus per-diff indexing means the indexer fleet must keep up. Fanout bounds per-diff cost (see concepts/incremental-indexing) but the aggregate is still pipeline-shaped.
- False positives in commit-search. Linking stack traces to recent commits that touched affected functions is an approximation; symptom and cause may be unrelated despite the overlap. The Meta post names the use case but doesn't quantify precision / recall.
Relation to other patterns¶
- patterns/centralized-ahead-of-time-indexing — the base index; diff-based analysis layers on top.
- patterns/language-neutral-schema-abstraction — lets the same sketch shape carry changes across multiple languages without per-language consumer code.
- Diff sketches are distinct from Kotlinator's build-error-driven fix loop — Kotlinator runs a build, parses compiler errors, and emits targeted fixes; this pattern is purely static-analysis driven off an index, no compile.
Seen in¶
- sources/2025-01-01-meta-indexing-code-at-scale-with-glean — the canonical wiki reference: names the sketch, names the five consumers, names Phabricator as the end-user surface.