CONCEPT Cited by 3 sources
Automated root-cause analysis¶
Definition¶
Automated root-cause analysis (RCA) is the practice of encoding enough operational knowledge in tooling that, when an alert fires, a probable root cause is computed (rather than displayed as a dashboard) and handed to the oncall — or, in the limit, the remediation itself is automated and the oncall is never paged.
The Meta analyzer pattern¶
sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale names this explicitly:
"When monitoring systems fire alerts about customer-facing SLA breaches, the analyzers are triggered. They source information from a wide range of monitoring systems (Operational Data Store or ODS), events published to Scuba, and even host-level logs. Custom logic in the analyzer then ties all this information together to infer probable root cause."
Three features distinguish the Meta analyzer from a generic alert:
- Multi-source aggregation. The analyzer pulls from ODS (metrics), Scuba (events), and host logs — three very different data shapes.
- Rule-encoded heuristics. Custom logic ties the signals together — the analyzer is not a generic anomaly detector but a Presto-domain-specific reasoner.
- Optional auto-remediation. For some failure classes "we have completely automated both the debugging and the remediation so that the oncall doesn't even need to get involved."
Two Meta examples from the source¶
- Bad-host detection + auto-drain. Attribution analyzer identifies a host as the source of too many query failures → auto-drain (see patterns/bad-host-auto-drain).
- Queueing-issue debugging. When queries queue too long, the analyzer pulls routing-decision inputs (cluster queue state, datacenter topology, table data locality) and outputs probable root cause. "This is another instance where analyzers come to the fore by pulling information from multiple sources and presenting conclusions."
Why it matters at scale¶
At hyperscale, manual investigation is not scalable. Meta's explicit scaling advice:
"Manual investigations in the face of customer-impacting production issues are not scalable. It's imperative to have automated debugging in place so that the root cause can be quickly determined."
The analyzer is the bridge between a customer-facing SLA breach alert and either a human fix or a machine fix.
Seen in¶
- sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale — the "analyzers" pattern for Meta's Presto oncall.
- sources/2024-08-23-meta-leveraging-ai-for-efficient-incident-response — Meta's LLM-powered sibling of the analyzer pattern, applied to web monorepo incident investigations. Preserves multi-source aggregation (code + directory ownership + runtime code graph + investigation metadata) and the closed-feedback-loop + precision-over-reach posture; replaces hand-coded reasoning rules with a fine-tuned Llama 2 (7B) ranker via ranking-via-election. 42% top-5 accuracy at investigation-creation time on backtested historical investigations. Canonical wiki instance of retrieve-then-rank-LLM applied to RCA.
- sources/2026-04-28-expedia-expedias-service-telemetry-analyzer — Expedia STAR applied to service-level outage / degradation investigation using observability metrics from Datadog (traffic, latency, saturation, Kubernetes signals, JVM signals). A deliberately non-agentic LLM realisation: a fixed four-step prompt chain (collect → per-metric analyse → aggregate RCA → return insights) with no tool use, no MCP, no RAG, no memory — the design aim is "avoid the additional and currently less understood failure modes of an agent." Complements Meta's RCA variants along three axes: (1) service-metric-first rather than code-first; (2) prompt-chain rather than retrieve-then-rank or heuristic-rules; (3) SME-gated qualitative evaluation rather than 42%-top-5 offline backtest. Canonical wiki instance of patterns/multi-step-rca-workflow and patterns/static-prompt-chain-over-agent-loop.
Related¶
- concepts/customer-facing-sla — the trigger for analyzers.
- concepts/queueing-theory — queueing is a canonical analyzer target.
- concepts/llm-based-ranker — the stage-2 role in the LLM-powered variant.
- concepts/heuristic-retrieval — the stage-1 role in the LLM-powered variant.
- concepts/ranking-via-election — the prompt-structure primitive.
- concepts/prompt-chaining — the orchestration primitive in the static-chain (Expedia STAR) variant.
- concepts/time-to-know-vs-time-to-recover — the RCA KPIs.
- patterns/oncall-analyzer — the pre-LLM pattern.
- patterns/retrieve-then-rank-llm — the Meta LLM-powered pattern.
- patterns/multi-step-rca-workflow — the static-chain LLM-powered pattern.
- patterns/static-prompt-chain-over-agent-loop — the generalised "stay below agent altitude" architectural posture.
- patterns/closed-feedback-loop-ai-features — the safety discipline Meta pairs with both variants.
- systems/meta-presto-gateway — the system whose routing decisions analyzers often need to reconstruct.
- systems/meta-rca-system — Meta's LLM-powered RCA system.
- systems/expedia-star — Expedia's static-chain RCA system.