PATTERN Cited by 1 source
Oncall analyzer¶
Purpose-built tool that pulls cross-system signals on alert, applies custom logic to narrow the root cause, and hands the oncall a probable diagnosis plus mitigation options — sometimes fully automating the remediation with no human involvement at all. Sits between the alert fabric (SLA breach → page) and the oncall human, replacing the "stare-at-dashboards" phase of incident response.
Why analyzers, not dashboards¶
At small scale an oncall can eyeball dashboards and compare metrics. At large scale with complex routing — queueing state, hardware distribution, data locality for every query — a human cannot hold enough context in working memory to diagnose quickly. Dashboards still help, but the first-step RCA is better done by a tool that already knows how to query the relevant systems and apply rules.
Machinery required¶
- Alert trigger: analyzer runs when monitoring fires on a customer-facing-SLA breach, not on a schedule.
- Cross-system data pull: the analyzer queries monitoring systems (e.g. ODS-style metric stores, Scuba-style log/event analytics), host-level logs, and cluster-state sources in one pass.
- Custom rule logic: domain-specific heuristics narrow the root cause. For query engines this might include: "is one cluster backed up? is a specific datacenter's locality broken? is a specific query shape flooding a tier?"
- Output: probable root cause plus ranked mitigation options for the oncall to act on — or, for specific well-understood failure classes, automated remediation with no oncall touch.
Evolution toward autonomy¶
The same pattern applied with rule-encoded analyzers in 2023 is evolving toward LLM-based analyzers by 2024 — see systems/meta-rca-system for Meta's Llama-2-based retrieve-then-rank RCA system for web-monorepo investigations, a sibling of the Presto analyzer pattern at a different altitude.
Seen in¶
- sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale — Meta's Presto fleet. Oncall analyzers pull from ODS, Scuba, and host logs, apply custom logic, and present the oncall with a probable root cause for queueing / failure incidents. The named escalation: "we have completely automated both the debugging and the remediation so that the oncall doesn't even need to get involved" for some failure classes. Queueing incidents are the canonical example — routing decisions involve current queueing state across clusters, hardware distribution, and data locality, which the analyzer can cross-reference faster than a human can.