CONCEPT Cited by 2 sources
Automated root-cause analysis¶
Definition¶
Automated root-cause analysis (RCA) is the practice of encoding enough operational knowledge in tooling that, when an alert fires, a probable root cause is computed (rather than displayed as a dashboard) and handed to the oncall — or, in the limit, the remediation itself is automated and the oncall is never paged.
The Meta analyzer pattern¶
sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale names this explicitly:
"When monitoring systems fire alerts about customer-facing SLA breaches, the analyzers are triggered. They source information from a wide range of monitoring systems (Operational Data Store or ODS), events published to Scuba, and even host-level logs. Custom logic in the analyzer then ties all this information together to infer probable root cause."
Three features distinguish the Meta analyzer from a generic alert:
- Multi-source aggregation. The analyzer pulls from ODS (metrics), Scuba (events), and host logs — three very different data shapes.
- Rule-encoded heuristics. Custom logic ties the signals together — the analyzer is not a generic anomaly detector but a Presto-domain-specific reasoner.
- Optional auto-remediation. For some failure classes "we have completely automated both the debugging and the remediation so that the oncall doesn't even need to get involved."
Two Meta examples from the source¶
- Bad-host detection + auto-drain. Attribution analyzer identifies a host as the source of too many query failures → auto-drain (see patterns/bad-host-auto-drain).
- Queueing-issue debugging. When queries queue too long, the analyzer pulls routing-decision inputs (cluster queue state, datacenter topology, table data locality) and outputs probable root cause. "This is another instance where analyzers come to the fore by pulling information from multiple sources and presenting conclusions."
Why it matters at scale¶
At hyperscale, manual investigation is not scalable. Meta's explicit scaling advice:
"Manual investigations in the face of customer-impacting production issues are not scalable. It's imperative to have automated debugging in place so that the root cause can be quickly determined."
The analyzer is the bridge between a customer-facing SLA breach alert and either a human fix or a machine fix.
Seen in¶
- sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale — the "analyzers" pattern for Meta's Presto oncall.
- sources/2024-08-23-meta-leveraging-ai-for-efficient-incident-response — Meta's LLM-powered sibling of the analyzer pattern, applied to web monorepo incident investigations. Preserves multi-source aggregation (code + directory ownership + runtime code graph + investigation metadata) and the closed-feedback-loop + precision-over-reach posture; replaces hand-coded reasoning rules with a fine-tuned Llama 2 (7B) ranker via ranking-via-election. 42% top-5 accuracy at investigation-creation time on backtested historical investigations. Canonical wiki instance of retrieve-then-rank-LLM applied to RCA.
Related¶
- concepts/customer-facing-sla — the trigger for analyzers.
- concepts/queueing-theory — queueing is a canonical analyzer target.
- concepts/llm-based-ranker — the stage-2 role in the LLM-powered variant.
- concepts/heuristic-retrieval — the stage-1 role in the LLM-powered variant.
- concepts/ranking-via-election — the prompt-structure primitive.
- patterns/oncall-analyzer — the pre-LLM pattern.
- patterns/retrieve-then-rank-llm — the LLM-powered pattern.
- patterns/closed-feedback-loop-ai-features — the safety discipline Meta pairs with both variants.
- systems/meta-presto-gateway — the system whose routing decisions analyzers often need to reconstruct.
- systems/meta-rca-system — the LLM-powered RCA system.