Skip to content

META 2025-01-01 Tier 1

Read original ↗

Meta — Indexing code at scale with Glean

Summary

Meta Engineering post (2024-12-19, syndicated to the raw corpus with a 2025-01-01 published date; 132 HN points) on Glean, Meta's open-source code-indexing system. Glean was open-sourced in August 2021 and powers a centralized, network- queryable index of facts about Meta's source code — consumed by code browsing, code search, auto-generated docs, code review, IDE acceleration, dead-code detection, API-migration tracking, test selection, automated data removal, and RAG in AI coding assistants.

The architecturally load-bearing content is in four areas:

  1. Centralized ahead-of-time indexing vs IDE-local indexing. At monorepo scale the IDE cannot index every dependency at startup (C++ compile times make this especially bad), so Glean shifts indexing off the developer machine onto a shared, highly-parallelized fleet, with a network query service fronting a replicated database. This is the centralized AOT indexing pattern.
  2. Generality as a schema + query decision. Glean doesn't pick a data model for you. Each language gets its own schema; Glean stores arbitrary non-language data too. Storage is RocksDB. The query language is Angle — a declarative logic-based language (Angle = anagram of Glean, "to fish") that supports deriving information automatically either at query time or ahead of time. Angle's view-layer mechanism is the language-neutral schema abstraction pattern: keep the detailed language-specific data and project a common cross-language view from it, so you don't have to pick one or the other.
  3. Incremental indexing via stacked immutable databases. As Meta's monorepo grows, full re-indexing stays perpetually out of date. Glean indexes just the changesO(fanout) rather than O(repository) — via a "stack of immutable databases" where each layer can non-destructively add or hide information from the layers below. A stack behaves like a single database from the client's perspective but can encode revision deltas cheaply. This is the stacked immutable databases mechanism; computing the set of files to reprocess for a change is itself a Glean query (the fanout closure).
  4. Diff sketches for analysing code changes. Glean indexing runs on diffs to extract a machine-readable diff sketch — the list of class/method/field/call changes in a changeset. Diff sketches power lint rules, rich notifications, semantic-search over commits (e.g. tying a production stack trace to recent commits that modified the affected functions), and go-to-definition + hovercards on the code being reviewed in the code-review tool (Phabricator). This is the diff-based static analysis pattern generalised across multiple downstream consumers.

Two system-level primitives are also introduced:

  • Glass — the symbol server that abstracts Glean's language-specific schemas behind a uniform code- navigation API. Code browsers (Meta's uses Monaco) call one endpoint — documentSymbols(repo, path, revision) — to render outline + references; other endpoints drive Find References and Call Hierarchy. Glass assigns every symbol a language-specific symbol ID (e.g. REPOSITORY/cpp/folly/Singleton) that stays stable as the definition moves.
  • IDE augmentation for C++. Because Glean has already indexed the whole repo, Meta's VS Code C++ extension exposes go-to-def / find-references / hovercards at IDE startup — before clangd finishes analysing the working set — then seamlessly blends Glean data with clangd as files load. C++ was the launch target because C++ developers "typically have the worst IDE experience due to the long compile times."

Key takeaways

  1. Code indexing as a shared, centralized service. "For large projects it becomes impractical to have the IDE process all the code of your project at startup… With a larger codebase and many developers working on it, it makes sense to have a shared centralized indexing system so that we don't repeat the work of indexing on every developer's machine. And as the data produced by indexing can become large, we want to make it available over the network through a query interface rather than having to download it." (Source: Meta Engineering, 2024-12-19.) Canonical statement of the centralized AOT indexing pattern.
  2. Distributed by construction. Indexing is "heavily parallelized"; the query service is "widely distributed to support load from many clients that are also distributed"; databases are "replicated across the query service machines and also backed up centrally." Scale-out on both sides of the producer/consumer split.
  3. Generality over use-case fit. Glean deliberately "doesn't decide for you what data you can store" and "[Angle's] query language is very general." Storage is RocksDB; arbitrary non-language data is supported. The generalisation paid off — "Glean… extend[ed] to a number of use cases beyond what we originally envisaged."
  4. Angle is declarative + logic-based, with derivation. "Defining a schema for Glean is just like writing a set of type definitions." A predicate ≈ a SQL table; its instances are facts ≈ SQL rows. Queries are prefix-indexed: specifying the leading fields makes retrieval efficient. Published latencies: "Glean can return results for this query in about a millisecond" (name+namespace lookup); for an inheritance-chain query, "first results in a few milliseconds" with incremental streaming of the remainder.
  5. Language-neutral view via schema-level abstraction. "We don't have to compromise between having detailed language-specific data or a lowest-common-denominator language-neutral view; we can have both." Mechanism: Angle lets you define a view layer in the schema itself, analogous to SQL views, projecting cross-language queries (e.g. "all declarations in this file") over language-specific fact tables.
  6. Incremental indexing target: O(changes), realistic floor: O(fanout). "In terms of computer science big-O notation, we want the cost of indexing to be O(changes) rather than O(repository)… Even if we figure out a way to represent the changes, in practice it isn't possible to achieve O(changes) for many programming languages. For example, in C++ if a header file is modified, we have to reprocess every source file that depends on it (directly or indirectly). We call this the fanout." The fanout closure is computed by querying Glean for the transitive #include-ers (C++ case) until fixpoint.
  7. Stacked immutable databases as the incrementality substrate. "Glean solves the first problem with an ingenious method of stacking immutable databases on top of each other. A stack of databases behaves just like a single database from the client's perspective, but each layer in the stack can non-destructively add information to, or hide information from, the layers below." Enables serving multiple revisions simultaneously without duplicating full snapshots.
  8. Glass = one API, every language. Code-browser integration reduces to documentSymbols(repo, path, revision) — list of definitions + references with source/target spans. Find References, Call Hierarchy, outlines, doc-hovercards all route through Glass. Every symbol gets a stable symbol ID so URLs to documentation survive code motion.
  9. IDE augmentation beats IDE replacement on large repos. For C++, Glean provides repo-wide go-to-def / find-references / hovercards at IDE startup; clangd's per-file analysis layers on top. "As the IDE loads the files the developer is working on, the C++ language service seamlessly blends the Glean-provided data with that provided by the native clangd backend."
  10. Diff sketches unify review-side tooling. A machine-readable summary of every diff ("a new class, remove a method, add a field… introduce a new call to a function") drives: static analysis on the change, non-trivial lint rules, rich notifications, commit-level semantic search ("connecting a production stack trace to recent commits that modified the affected function(s)"), and accurate go-to-def in code review across C++, Python, PHP, JavaScript, Rust, Erlang, Thrift, Haskell.
  11. Beyond navigation. The post enumerates the Glean-powered use cases that accreted post-launch: analysing build-dependency graphs, detecting and removing dead code ("unused #include or using statements"), tracking API-migration progress, code-complexity metrics, test coverage + test selection, automated data removal, and RAG in AI coding assistants. Each validates the "general-purpose facts about code" design bet.

Systems extracted

  • systems/glean — the system itself: collectors + RocksDB-backed fact database + Angle query service + schemas, open-sourced 2021.
  • systems/angle-query-language — Glean's declarative logic-based schema + query language.
  • systems/glass-symbol-server — the symbol server that sits in front of Glean for code-navigation use cases; owns symbol IDs, abstracts language-specific fact layouts into a uniform API.
  • systems/rocksdb — the underlying key-value store Glean uses for fact storage.
  • systems/lsif — the LSP-ecosystem predecessor / alternative format Glean explicitly contrasts with.
  • systems/monaco-editor — the VS Code-lineage web editor Meta's code browser embeds as its Glean/Glass client.
  • systems/phabricator — Meta's code-review tool; the end-user surface for diff-sketch-driven review-time code navigation. (Phabricator is not created as a new wiki page here — it is named as the integration point only.)

Concepts extracted

  • concepts/code-indexing — the general technical primitive of extracting structured facts from source code for downstream tools (navigation, search, docs, lint).
  • concepts/symbol-id — a language-specific stable string identifier for a named code symbol; links (docs, references) keep working across symbol-definition movement.
  • concepts/incremental-indexing — processing only changes, targeting O(fanout); canonical instance: Glean.
  • concepts/stacked-immutable-databases — Glean's mechanism for representing multiple revisions without duplicating full snapshots; layers compose non-destructively.
  • concepts/diff-sketch — a machine-readable summary of what a diff changed (declarations added/removed, methods, fields, calls).
  • concepts/monorepo — the scaling context that motivates Glean's entire design.
  • concepts/fanout-and-cycle — the existing wiki page gets a "fanout in the incremental-indexing sense" cross-reference; in Glean, fanout closure is computed by recursive query over #include-ers until fixpoint.

Patterns extracted

  • patterns/centralized-ahead-of-time-indexing — shift indexing off the developer machine, share the index over the network. Glean is the canonical wiki instance.
  • patterns/language-neutral-schema-abstraction — keep the detailed language-specific facts and define a common-view layer in the schema language itself (SQL-views-style). Lets cross- language queries and language-specific queries coexist.
  • patterns/diff-based-static-analysis — index the diff, produce a machine-readable change summary, and fan it out to lint / search / notifications / review-time nav. Glean's diff sketches are the canonical wiki instance.

Operational numbers

Metric Value Source
Name + namespace lookup latency "about a millisecond" post
Inheritance-chain query first-results latency "a few milliseconds" post
Open-source date "In August 2021 we open-sourced" post
Code-review code-nav languages C++, Python, PHP, JavaScript, Rust, Erlang, Thrift, Haskell post

Caveats

  • No deployment-scale numbers. The post describes the Glean architecture as "highly distributed", names replication + backups, but discloses no fleet sizes, index sizes (bytes), query-per-second rates, ingest throughput, or indexing-fleet parallelism numbers. Scale claims are qualitative.
  • Latency datapoints are illustrative, not load-tested. The ~1 ms and few-ms numbers are for specific sample queries; the post does not claim these are p50/p99 under production load.
  • Incremental indexing deep-dive deferred. The "stacking immutable databases" description is one paragraph plus a diagram; "full details are beyond the scope of this post" with a pointer to the Incremental indexing with Glean blog (not ingested in this round).
  • Published date ambiguity. Meta's blog URL dates the post 2024-12-19; the RSS/raw corpus shows published: 2025-01-01. Using the raw-file date to preserve ingest-pipeline consistency; the frontmatter URL dates are the authoritative timestamp from Meta's end.
  • Phabricator + Monaco named only as integration surfaces. No architectural detail on either is disclosed in this post; they are named consumers of Glean/Glass APIs. Phabricator is explicitly cited as the code-review tool; Monaco is named as the code-browser editor component. No standalone wiki pages are created for either system in this ingest.
  • Symbol-ID format is per-language. The example REPOSITORY/cpp/folly/Singleton is for C++; "the exact format for a symbol ID varies per language." Wiki page generalises without fixing a format.

Source

Last updated · 319 distilled / 1,201 read