Skip to content

CONCEPT Cited by 2 sources

Automated root-cause analysis

Definition

Automated root-cause analysis (RCA) is the practice of encoding enough operational knowledge in tooling that, when an alert fires, a probable root cause is computed (rather than displayed as a dashboard) and handed to the oncall — or, in the limit, the remediation itself is automated and the oncall is never paged.

The Meta analyzer pattern

sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale names this explicitly:

"When monitoring systems fire alerts about customer-facing SLA breaches, the analyzers are triggered. They source information from a wide range of monitoring systems (Operational Data Store or ODS), events published to Scuba, and even host-level logs. Custom logic in the analyzer then ties all this information together to infer probable root cause."

Three features distinguish the Meta analyzer from a generic alert:

  1. Multi-source aggregation. The analyzer pulls from ODS (metrics), Scuba (events), and host logs — three very different data shapes.
  2. Rule-encoded heuristics. Custom logic ties the signals together — the analyzer is not a generic anomaly detector but a Presto-domain-specific reasoner.
  3. Optional auto-remediation. For some failure classes "we have completely automated both the debugging and the remediation so that the oncall doesn't even need to get involved."

Two Meta examples from the source

  • Bad-host detection + auto-drain. Attribution analyzer identifies a host as the source of too many query failures → auto-drain (see patterns/bad-host-auto-drain).
  • Queueing-issue debugging. When queries queue too long, the analyzer pulls routing-decision inputs (cluster queue state, datacenter topology, table data locality) and outputs probable root cause. "This is another instance where analyzers come to the fore by pulling information from multiple sources and presenting conclusions."

Why it matters at scale

At hyperscale, manual investigation is not scalable. Meta's explicit scaling advice:

"Manual investigations in the face of customer-impacting production issues are not scalable. It's imperative to have automated debugging in place so that the root cause can be quickly determined."

The analyzer is the bridge between a customer-facing SLA breach alert and either a human fix or a machine fix.

Seen in

Last updated · 319 distilled / 1,201 read