Skip to content

SYSTEM Cited by 1 source

Meta AI-assisted RCA system

Meta's internal AI-assisted root-cause analysis (RCA) system for reliability investigations on the web monorepo. Combines a heuristic retriever (narrows thousands of recent code changes to a few hundred via code/directory ownership + runtime code graph traversal) with a Llama 2 (7B)-based ranker that reduces the remaining candidates to a top-five list via ranking-via-election (20 candidates per prompt → 5 selected → recurse).

Architecture

Two stages:

  1. Heuristic retriever. Non-ML, domain-rule-encoded. Inputs: investigation title + observed impact + runtime signals. Outputs: "a few hundred" code changes from an input of "thousands." Uses code/directory ownership and runtime code-graph exploration of impacted systems. Meta's claim: "reducing the search space from thousands of changes to a few hundred without significant reduction in accuracy."
  2. LLM-based ranker. Llama 2 (7B) fine-tuned on Meta-specific artefacts + an RCA SFT dataset. Ranks via election: each prompt holds ≤20 changes, asks for top-5, outputs aggregated and the process repeated until only five candidates remain. Additionally produces a logprob-ranked list using the fine-tuning prompt format, where the expected output is "a list of potential code changes likely responsible for the issue ordered by their logprobs-ranked relevance."

Training pipeline

  • Base: Llama 2 (7B).
  • Continued pre-training (CPT): on "limited and approved internal wikis, Q&As, and code" to expose the model to Meta artifacts.
  • Mixed supervised fine-tuning (SFT): Llama 2's original SFT data + internal context + dedicated RCA SFT dataset of ~5,000 instruction-tuning examples, each with 2-20 candidate changes + the known root cause + information available at investigation start.
  • Logprob-ranking SFT: a second SFT round teaches the model to emit ranked lists natively, enabling logprob-based scoring in addition to natural-language top-5 output.

Outcome disclosed

  • 42% top-5 accuracy at investigation-creation time on backtested historical web-monorepo investigations — measured with only information available when the investigation was opened.

Not disclosed: top-1 / top-3 accuracy, retriever recall, latency, GPU/token cost per investigation, production precision vs recall, responder-override rate, confidence-threshold cutoffs.

Safety primitives

Meta's explicit design discipline for employee-facing AI features:

  • Closed feedback loops — responders can independently reproduce and validate the system's output.
  • Explainability — results are traceable to their inputs (which changes, which ownership rules, which code-graph paths).
  • Confidence thresholding"detect low confidence answers and avoid recommending them to the users — sacrificing reach in favor of precision."

Position in Meta's investigation-tooling lineage

  • Predecessor: systems/hawkeye-meta (December 2023) — ML-workflow debugging.
  • This system (June 2024) — web-monorepo incident response.
  • Future work named: "autonomously execute full workflows and validate their results" + "detect potential incidents prior to code push."

Relationship to Meta's pre-LLM analyzer pattern

The Presto-oncall analyzers (automated RCA, per sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale) use multi-source aggregation + rule-encoded heuristics + optional auto-remediation. This 2024 system retains the multi-source retrieval shape (code ownership + runtime code graph + investigation metadata) and the closed-feedback-loop posture, but replaces hand-coded rules in the reasoning stage with a fine-tuned LLM ranker.

Seen in

Last updated · 319 distilled / 1,201 read