Skip to content

PATTERN Cited by 1 source

Diff-based static analysis

Pattern: run the code indexer on each diff (pull request) to produce a machine-readable summary of the change (" diff sketch"), then fan that summary out to multiple downstream consumers — static-analysis checks, lint rules, commit-level notifications, semantic commit search, and code-review-time navigation.

Canonical wiki instance: Meta's GleanPhabricator pipeline (Source: sources/2025-01-01-meta-indexing-code-at-scale-with-glean).

Problem

Code review and pre-commit analysis want semantic context, not just textual context:

  • "This diff removes the last call site of function X" — needs repo-wide reference data.
  • "This diff introduces a new call to a deprecated API" — needs semantic API knowledge, not regex.
  • "Jump to the definition of this symbol in the diff view" — needs index-backed navigation.
  • "Connect this production stack trace to recent commits that touched these functions" — needs cross-commit semantic search.

Running the full repo-level analysis for every diff is prohibitively expensive. Pure textual diffs don't carry enough semantic signal.

Shape

Diff ──▶ Incremental indexer ──▶ Diff sketch ──┬─▶ Static analysis
                                               ├─▶ Semantic lint
                                               ├─▶ Rich notifications
                                               ├─▶ Commit-level search
                                               └─▶ Review-time nav (Phabricator)

Three-stage pipeline:

  1. Index the diff. The indexer runs on the changeset — not the whole repo — producing a per-diff delta on top of the base index. Complexity is O(fanout) per diff, not O(repo).
  2. Extract the diff sketch. Summarise what the diff did to the code graph: classes / methods / fields introduced or removed, calls added or removed, inheritance edges changed. All entities keyed by stable symbol IDs so downstream consumers can join against the base index.
  3. Route to consumers. Each consumer subscribes to the sketch and makes its own decisions.

Consumers named in the Meta post

From the 2024-12-19 post:

  • Static analysis. "Diff sketches are used to drive a simple static analysis that can identify potential issues that might require further review."
  • Non-trivial lint rules. Rules that depend on semantic context — e.g. "last caller of function X removed" — that a text-level linter can't express.
  • Rich notifications. Teams subscribe to semantic events ("notify me when anyone modifies function Y") rather than path patterns.
  • Semantic commit search. "Connecting a production stack trace to recent commits that modified the affected function(s), to help root-cause performance issues or new failures." — headline application.
  • Review-time code navigation. Accurate go-to-definition, type-on-hover, and documentation rendered on the diff inside Meta's code-review tool (Phabricator). "This is a powerful lift to the code review process, making it easier for reviewers to understand the changes and provide valuable review feedback." Meta's Glean-powered review-time nav covers C++, Python, PHP, JavaScript, Rust, Erlang, Thrift, Haskell.

Why "sketch" instead of raw index diff

Two reasons the consumer pipeline benefits from a pre-aggregated sketch rather than raw layered index output:

  • Uniform consumer interface. Each consumer gets a structured list of semantic changes; it doesn't have to diff two index revisions itself.
  • Consumer-scoped projection. Different consumers need different slices (review-nav needs span-level detail; lint needs entity-level adds/removes). A sketch is a natural projection point.

Trade-offs

  • Sketch schema governance. Cross-consumer agreement on what the sketch contains is its own design problem. Adding a new kind of semantic-change affects every consumer.
  • Commit-rate back-pressure. A high monorepo commit rate plus per-diff indexing means the indexer fleet must keep up. Fanout bounds per-diff cost (see concepts/incremental-indexing) but the aggregate is still pipeline-shaped.
  • False positives in commit-search. Linking stack traces to recent commits that touched affected functions is an approximation; symptom and cause may be unrelated despite the overlap. The Meta post names the use case but doesn't quantify precision / recall.

Relation to other patterns

Seen in

Last updated · 319 distilled / 1,201 read