Skip to content

SYSTEM Cited by 1 source

Glean

Glean is Meta's open-source code-indexing system: a centralized service that collects, derives, and queries structured facts about source code. Glean was open-sourced in August 2021 and is the substrate for Meta's code browsing, code search, auto-generated docs, code review, IDE acceleration, dead-code detection, API-migration tracking, test selection, automated data removal, and RAG in AI coding assistants (Source: sources/2025-01-01-meta-indexing-code-at-scale-with-glean).

Architecture

Four layers:

  1. Indexers — per-language collectors that walk source and emit facts conforming to that language's schema. Indexing is "heavily parallelized and we may have many indexing jobs running concurrently."
  2. Fact database — facts are stored in RocksDB, "providing good scalability and efficient retrieval." Databases are "replicated across the query service machines and also backed up centrally."
  3. Query service"widely distributed to support load from many clients that are also distributed." Exposes Angle queries over the network.
  4. Schemas — one per language (plus arbitrary non-language schemas). Each schema defines predicates (≈ SQL tables) whose instances are facts (≈ SQL rows). Schemas also compose: the schema language supports "deriving information automatically, either on-the-fly at query time or ahead of time" — Glean's mechanism for defining language-neutral views over language-specific facts.

Core design bets

  • Centralize indexing, network-query the result. The IDE-local model breaks down at monorepo scale; Glean is the canonical wiki instance of the centralized AOT indexing pattern.
  • Don't decide the data model for users. "Glean doesn't decide for you what data you can store" — each language owns its schema; arbitrary non-language data is supported. Trade-off: a lowest-common-denominator model would have been faster to build but would not have enabled the dead-code / build-graph / data-removal / RAG use cases Glean accreted after launch.
  • Declarative logic-based query language. Angle is general enough to express schema-level derivation, cross-language views, and transitive closures (e.g. C++ #include fanout is a Glean query).
  • Incremental indexing. Target O(changes), realistic floor O(fanout); implemented via stacked immutable databases.

What Glean differs from

Named alternative: LSIF (Language Server Index Format), the LSP-ecosystem format IDEs use to cache navigation data. Glean "wasn't tied either to particular programming languages or to any particular use case" — an explicit generality contrast vs LSIF's LSP-centric feature set.

Consumers at Meta

  • Glass — Meta's symbol server built on top of Glean. Uniform code-navigation API; used by the internal code browser (embedded Monaco) and by Phabricator code review.
  • C++ IDE augmentation. Meta's VS Code C++ extension serves go-to-definition / find-references / hovercards from Glean at IDE startup, before clangd finishes analysing the working set, then blends Glean and clangd as files load.
  • Documentation generation. API structure + doc comments extracted into Glean → rendered client-side; every symbol gets a stable symbol ID so doc URLs survive code motion.
  • Diff sketches. Glean indexes diffs to produce a diff sketch; downstream static analysis, lint rules, commit-level semantic search, and review-time go-to-definition all consume sketches. See patterns/diff-based-static-analysis.
  • Ad-hoc + post-launch uses. Build-dependency graph analysis, dead-code detection, API-migration progress, code-complexity metrics, test coverage + test selection, automated data removal, RAG in AI coding assistants.

Published performance

From the 2024-12-19 post (illustrative, not load-tested):

Query shape Latency
FunctionDeclaration by name + namespace "about a millisecond"
Inheritance-chain + overriding-method lookup "first results in a few milliseconds", streamed incrementally

No fleet, throughput, or index-size numbers are disclosed.

Query language taste (Angle)

Define predicates as type records; query by prefix of fields.

predicate FunctionDeclaration {
  name : string,
  namespace : string,
  location : Source
}

A query specifying name and namespace is prefix-indexed and fast because the schema declares that field order. Angle also supports more complex queries — e.g. "classes that inherit from exception and override a method called what" — with incremental streaming of results from the query server.

See systems/angle-query-language for the full treatment.

Open source

Seen in

Last updated · 319 distilled / 1,201 read