META 2024-08-23

Meta — Leveraging AI for efficient incident response¶

Summary¶

A Meta Engineering post describing the AI-assisted root-cause analysis (RCA) system Meta uses during investigations of reliability issues in its web monorepo. The system combines a heuristics-based retriever — which narrows thousands of recent code changes to a few hundred candidates using code/directory ownership and the runtime code graph of impacted systems — with an Llama 2 (7B)-based ranker that reduces the candidate set to a top-five list through iterative ranking-via-election. Backtesting against historical investigations shows 42% accuracy at identifying the true root cause at investigation-creation time from information available only at that moment. The ranker was produced by a continued-pretraining (CPT) → supervised fine-tuning (SFT) pipeline on internal wikis, Q&As, code, and a dedicated RCA SFT dataset of ~5,000 examples. Meta explicitly pairs the system with closed feedback loops + explainability + confidence thresholding to avoid misleading engineers.

Key takeaways¶

42% top-5 accuracy at investigation-creation time on the web monorepo. This is a backtesting number against historical investigations with known root causes, using only information available when the investigation was first created (title + observed impact — "information density is low at this point"). The accuracy metric is "root cause is in the top five suggested code changes," not top-1. (Source text)
Two-stage architecture: heuristic retriever → LLM ranker. Stage 1 uses "code and directory ownership or exploring the runtime code graph of impacted systems" to narrow "thousands of changes to a few hundred" without significant accuracy reduction. Stage 2 is a Llama-2-based ranker. This is the canonical retrieve-then-rank-LLM architecture applied to production RCA. (Source text)
Ranking through election — the prompt-structure innovation. Context windows limited in mid-2024; "we structure prompts to contain a maximum of 20 changes at a time, asking the LLM to identify the top five changes. The output across the LLM requests are aggregated and the process is repeated until we have only five candidates left." A tournament-style reduction that enables ranking across populations larger than any single prompt can hold. (Source text; new canonical wiki pattern — concepts/ranking-via-election)
Training pipeline: Llama 2 (7B) → continued pre-training (CPT) → supervised fine-tuning (SFT) → RCA-specific SFT. Meta started with Llama 2 (7B), ran CPT on limited and approved internal wikis, Q&As, and code to expose the model to Meta artifacts, then ran a mixed-SFT phase combining Llama 2's original SFT data with internal context + a dedicated RCA SFT dataset. The RCA SFT set has ~5,000 instruction-tuning examples — each with 2-20 candidate changes + the known root cause + the information available at investigation start. This extends the concepts/continued-pretraining pattern (already canonicalised via eBay's 2025 e-Llama) into the incident-response / smaller-base-model / proprietary-corpus axis. (Source text)
Logprobs as a ranking signal. Beyond natural-language ranking output, Meta uses the model's token-level log-probabilities over the same fine-tuning prompt format to produce a logprob-ranked list of candidate changes. A second SFT round on examples where the expected output is "a list of potential code changes likely responsible for the issue ordered by their logprobs-ranked relevance, with the expected root cause at the start" teaches the model to produce ranked lists natively. Canonical instance of using LLM logprobs as a ranking mechanism on an internal task. (Source text)
Closed feedback loops + explainability + confidence thresholding are load-bearing safety primitives. "We ensure that all employee-facing features prioritize closed feedback loops and explainability of results. This strategy ensures that responders can independently reproduce the results generated by our systems to validate their results. We also rely on confidence measurement methodologies to detect low confidence answers and avoid recommending them to the users — sacrificing reach in favor of precision." Explicit precision-over-recall posture for a misleading-output-is-expensive workload. (Source text; canonical wiki statement)
Monorepo context is load-bearing. The system targets Meta's web monorepo — a single unified repository with "the accumulating number of changes involved across many teams" as the cited scalability challenge. The large candidate set (thousands of changes) + the heuristic retriever's ability to narrow via directory ownership both depend on the monorepo structure. (Source text; reinforces concepts/monorepo with an RCA-tooling implication)
Named as the second instance in a Meta AI-investigation-tooling lineage. "This is why Meta is investing in advancing our suite of investigation tooling with tools like Hawkeye, which we use internally for debugging end-to-end machine learning workflows." systems/hawkeye-meta is the prior tool (ML debugging); this RCA system is the second (web-monorepo incident response). Future work: "autonomously execute full workflows and validate their results" + "detect potential incidents prior to code push." (Source text)

Systems / hardware extracted¶

systems/meta-rca-system — Meta's AI-assisted root-cause analysis system; two-stage heuristic-retriever + Llama-2-based ranker; 42% top-5 accuracy at investigation-creation on the web monorepo. New wiki page.
systems/hawkeye-meta — Meta's prior internal AI-debugging tool (December 2023) for end-to-end ML workflows; named as the predecessor in the investigation-tooling lineage. New wiki page (stub).
systems/llama-2 — Meta's 2023 open-weight foundation model family; the 7B variant is the base of the Meta RCA ranker. New wiki page.

Concepts extracted¶

concepts/llm-based-ranker — the architectural shape where an LLM scores/ranks items from a candidate set rather than generating free-form output. Meta's RCA ranker is the canonical instance. New.
concepts/heuristic-retrieval — the stage-1 primitive: non-ML, domain-rule-encoded reduction of a large search space to a few hundred candidates, without significant accuracy loss, cheap enough to run on every investigation. New.
concepts/ranking-via-election — the tournament-style rank-reduction primitive: pass N=20 items to the LLM, keep the top-5, recurse until only 5 remain. Enables ranking over populations larger than context window. New.
concepts/supervised-fine-tuning — the SFT stage of Meta's training pipeline. Sibling of continued pretraining + instruction tuning + RLHF. New canonical wiki concept page.

Existing concepts reinforced:

concepts/automated-root-cause-analysis — the Presto-analyzer framing (existing wiki page, Meta 2023) is rule-encoded heuristics + multi-source aggregation + optional auto-remediation. The 2024 RCA system is the LLM-powered sibling: same closed-feedback-loop posture, same multi-source aggregation, different stage-2 reasoning substrate (Llama 2 ranker vs hand-coded rules).
concepts/continued-pretraining — already canonicalised via eBay 2025 (1T tokens, Llama 3.1, 480 H100s, e-commerce); Meta adds the RCA / 7B / approved-internal-wikis variant with a much smaller base and narrower corpus, showing the CPT pattern applies below 70B and on a ~GB rather than TB scale.
concepts/monorepo — Meta's web monorepo is the specific substrate that creates both the scale problem (thousands of changes/day) and the structural affordance (directory ownership for heuristic retrieval).

Patterns extracted¶

patterns/retrieve-then-rank-llm — canonical wiki pattern: heuristic retriever narrows search space → LLM ranks within the narrowed set → top-K returned. Meta's variant: directory-ownership + runtime-code-graph retrieval, Llama-2 ranking-via-election, top-5 returned. New.
patterns/closed-feedback-loop-ai-features — architectural discipline for employee-facing AI features: results are explainable + independently reproducible + a feedback channel exists. Meta names it as the primary safety primitive for the RCA system. New.
patterns/confidence-thresholded-ai-output — confidence-measurement-driven refusal to recommend low-confidence answers. Meta's explicit "sacrifice reach for precision" posture. New.

Operational / architectural numbers¶

Datum	Value
Root-cause top-5 accuracy at investigation-creation time	42%
Base model	Llama 2, 7B parameters
Candidates per ranker prompt	≤20
Ranker output per prompt	Top 5
Final ranked list size	5 changes
Retriever output size	"A few hundred" changes
Retriever input size	"Thousands" of recent changes
RCA SFT dataset size	~5,000 instruction-tuning examples
RCA SFT examples candidate count	2–20 changes each
Training pipeline stages	2 (CPT → mixed-SFT with RCA-SFT appended)
Scope	Web monorepo investigations

Not disclosed: latency per investigation; infrastructure cost; GPU footprint; top-1 or top-3 accuracy; how the 42% number decomposes across investigation types; explicit confidence threshold; the retriever's own accuracy floor; the eval set size; precision vs recall numbers for the "sacrifice reach for precision" policy.

Caveats¶

Backtesting, not production. The 42% number is measured against historical investigations with known root causes. Production behaviour (including recall drop from confidence thresholding, responder-override rates, time-to-mitigation change) is not disclosed in this post.
Single-repo result. The evaluation scope is the Meta web monorepo. Whether the approach generalises across repos with different change-volume distributions, ownership granularity, or code-graph density is not claimed.
Top-5 metric is a loose bar. Top-5 is the metric that fits the post's ranking-via-election 20→5 reduction, but "engineer has to inspect 5 candidates" is still meaningful toil. Top-1 would be the tighter claim; Meta does not disclose it.
"42%" is ambiguous about filtering. The retriever pre-narrows to "a few hundred" — the top-5 accuracy is conditional on the true root cause surviving retrieval. Retriever recall is not disclosed. If the retriever misses, the ranker cannot recover.
Training corpus disclosures are abstract. "Limited and approved internal wikis, Q&As, and code" — no token count, corpus size, or mixing ratio is disclosed. The RCA-SFT dataset size (5,000 examples) is disclosed; CPT + mixed-SFT budgets are not.
Post-hoc methodology disclosures are thin. Meta names "confidence measurement methodologies" and "different ranking algorithms and prompting scenarios" without detailing them. The only prompt-structure disclosure is the 20-in/5-out ranking-via-election; the other explored alternatives are not named.
Llama 2 was the available base in early 2024. A 2026 reader should note that Meta has since iterated through Llama 3 and Llama 3.1; whether the same architecture re-trained on a larger base would change the accuracy number is not addressed.
Closed-feedback-loop claim is policy, not mechanism. Meta names the safety posture but does not disclose the feedback-collection mechanism, the rate at which responders disagree, or the recalibration cadence.

Source¶

companies/meta
sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale — the pre-LLM analyzer pattern (rule-encoded heuristics + multi-source aggregation + auto-remediation); sibling of this 2024 LLM-powered RCA system.
sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale — the training-substrate post (24K-GPU H100 clusters on RoCE + InfiniBand); the infrastructure on which Llama 3 class models are trained (Llama 2 pre-dates this substrate but Meta's trajectory runs through it).
sources/2024-06-16-meta-maintaining-large-scale-ai-capacity-at-meta — the production-engineering post about maintaining the AI capacity substrate that runs training + serving workloads.