SYSTEM Cited by 1 source

Meta AI-assisted RCA system¶

Meta's internal AI-assisted root-cause analysis (RCA) system for reliability investigations on the web monorepo. Combines a heuristic retriever (narrows thousands of recent code changes to a few hundred via code/directory ownership + runtime code graph traversal) with a Llama 2 (7B)-based ranker that reduces the remaining candidates to a top-five list via ranking-via-election (20 candidates per prompt → 5 selected → recurse).

Architecture¶

Two stages:

Heuristic retriever. Non-ML, domain-rule-encoded. Inputs: investigation title + observed impact + runtime signals. Outputs: "a few hundred" code changes from an input of "thousands." Uses code/directory ownership and runtime code-graph exploration of impacted systems. Meta's claim: "reducing the search space from thousands of changes to a few hundred without significant reduction in accuracy."
LLM-based ranker. Llama 2 (7B) fine-tuned on Meta-specific artefacts + an RCA SFT dataset. Ranks via election: each prompt holds ≤20 changes, asks for top-5, outputs aggregated and the process repeated until only five candidates remain. Additionally produces a logprob-ranked list using the fine-tuning prompt format, where the expected output is "a list of potential code changes likely responsible for the issue ordered by their logprobs-ranked relevance."

Training pipeline¶

Base: Llama 2 (7B).
Continued pre-training (CPT): on "limited and approved internal wikis, Q&As, and code" to expose the model to Meta artifacts.
Mixed supervised fine-tuning (SFT): Llama 2's original SFT data + internal context + dedicated RCA SFT dataset of ~5,000 instruction-tuning examples, each with 2-20 candidate changes + the known root cause + information available at investigation start.
Logprob-ranking SFT: a second SFT round teaches the model to emit ranked lists natively, enabling logprob-based scoring in addition to natural-language top-5 output.

Outcome disclosed¶

42% top-5 accuracy at investigation-creation time on backtested historical web-monorepo investigations — measured with only information available when the investigation was opened.

Not disclosed: top-1 / top-3 accuracy, retriever recall, latency, GPU/token cost per investigation, production precision vs recall, responder-override rate, confidence-threshold cutoffs.

Safety primitives¶

Meta's explicit design discipline for employee-facing AI features:

Closed feedback loops — responders can independently reproduce and validate the system's output.
Explainability — results are traceable to their inputs (which changes, which ownership rules, which code-graph paths).
Confidence thresholding — "detect low confidence answers and avoid recommending them to the users — sacrificing reach in favor of precision."

Position in Meta's investigation-tooling lineage¶

Predecessor: systems/hawkeye-meta (December 2023) — ML-workflow debugging.
This system (June 2024) — web-monorepo incident response.
Future work named: "autonomously execute full workflows and validate their results" + "detect potential incidents prior to code push."

Relationship to Meta's pre-LLM analyzer pattern¶

The Presto-oncall analyzers (automated RCA, per sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale) use multi-source aggregation + rule-encoded heuristics + optional auto-remediation. This 2024 system retains the multi-source retrieval shape (code ownership + runtime code graph + investigation metadata) and the closed-feedback-loop posture, but replaces hand-coded rules in the reasoning stage with a fine-tuned LLM ranker.

Seen in¶

sources/2024-08-23-meta-leveraging-ai-for-efficient-incident-response — canonical introduction.

systems/llama-2 — the base model.
systems/hawkeye-meta — the ML-workflow debugging predecessor.
concepts/llm-based-ranker — the architectural shape of stage 2.
concepts/heuristic-retrieval — the architectural shape of stage 1.
concepts/ranking-via-election — the tournament-style prompt structure.
concepts/automated-root-cause-analysis — the capability class.
concepts/continued-pretraining — the base-model-adaptation step.
concepts/supervised-fine-tuning — the task-teaching step.
patterns/retrieve-then-rank-llm — the end-to-end pattern.
patterns/closed-feedback-loop-ai-features — the safety discipline.
patterns/confidence-thresholded-ai-output — the precision-over-reach policy.
companies/meta