SYSTEM Cited by 1 source
Meta AI-assisted RCA system¶
Meta's internal AI-assisted root-cause analysis (RCA) system for reliability investigations on the web monorepo. Combines a heuristic retriever (narrows thousands of recent code changes to a few hundred via code/directory ownership + runtime code graph traversal) with a Llama 2 (7B)-based ranker that reduces the remaining candidates to a top-five list via ranking-via-election (20 candidates per prompt → 5 selected → recurse).
Architecture¶
Two stages:
- Heuristic retriever. Non-ML, domain-rule-encoded. Inputs: investigation title + observed impact + runtime signals. Outputs: "a few hundred" code changes from an input of "thousands." Uses code/directory ownership and runtime code-graph exploration of impacted systems. Meta's claim: "reducing the search space from thousands of changes to a few hundred without significant reduction in accuracy."
- LLM-based ranker. Llama 2 (7B) fine-tuned on Meta-specific artefacts + an RCA SFT dataset. Ranks via election: each prompt holds ≤20 changes, asks for top-5, outputs aggregated and the process repeated until only five candidates remain. Additionally produces a logprob-ranked list using the fine-tuning prompt format, where the expected output is "a list of potential code changes likely responsible for the issue ordered by their logprobs-ranked relevance."
Training pipeline¶
- Base: Llama 2 (7B).
- Continued pre-training (CPT): on "limited and approved internal wikis, Q&As, and code" to expose the model to Meta artifacts.
- Mixed supervised fine-tuning (SFT): Llama 2's original SFT data + internal context + dedicated RCA SFT dataset of ~5,000 instruction-tuning examples, each with 2-20 candidate changes + the known root cause + information available at investigation start.
- Logprob-ranking SFT: a second SFT round teaches the model to emit ranked lists natively, enabling logprob-based scoring in addition to natural-language top-5 output.
Outcome disclosed¶
- 42% top-5 accuracy at investigation-creation time on backtested historical web-monorepo investigations — measured with only information available when the investigation was opened.
Not disclosed: top-1 / top-3 accuracy, retriever recall, latency, GPU/token cost per investigation, production precision vs recall, responder-override rate, confidence-threshold cutoffs.
Safety primitives¶
Meta's explicit design discipline for employee-facing AI features:
- Closed feedback loops — responders can independently reproduce and validate the system's output.
- Explainability — results are traceable to their inputs (which changes, which ownership rules, which code-graph paths).
- Confidence thresholding — "detect low confidence answers and avoid recommending them to the users — sacrificing reach in favor of precision."
Position in Meta's investigation-tooling lineage¶
- Predecessor: systems/hawkeye-meta (December 2023) — ML-workflow debugging.
- This system (June 2024) — web-monorepo incident response.
- Future work named: "autonomously execute full workflows and validate their results" + "detect potential incidents prior to code push."
Relationship to Meta's pre-LLM analyzer pattern¶
The Presto-oncall analyzers (automated RCA, per sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale) use multi-source aggregation + rule-encoded heuristics + optional auto-remediation. This 2024 system retains the multi-source retrieval shape (code ownership + runtime code graph + investigation metadata) and the closed-feedback-loop posture, but replaces hand-coded rules in the reasoning stage with a fine-tuned LLM ranker.
Seen in¶
- sources/2024-08-23-meta-leveraging-ai-for-efficient-incident-response — canonical introduction.
Related¶
- systems/llama-2 — the base model.
- systems/hawkeye-meta — the ML-workflow debugging predecessor.
- concepts/llm-based-ranker — the architectural shape of stage 2.
- concepts/heuristic-retrieval — the architectural shape of stage 1.
- concepts/ranking-via-election — the tournament-style prompt structure.
- concepts/automated-root-cause-analysis — the capability class.
- concepts/continued-pretraining — the base-model-adaptation step.
- concepts/supervised-fine-tuning — the task-teaching step.
- patterns/retrieve-then-rank-llm — the end-to-end pattern.
- patterns/closed-feedback-loop-ai-features — the safety discipline.
- patterns/confidence-thresholded-ai-output — the precision-over-reach policy.
- companies/meta