PATTERN Cited by 1 source
Multi-step RCA workflow¶
Problem. On-call engineers investigating a service degradation or outage need to (a) pull the relevant telemetry, (b) interpret each signal, (c) combine per-signal findings into a probable root cause hypothesis, and (d) produce something that's actionable within the incident's TTR budget. The work is cognitive, repetitive across incidents, and bound by the engineer's recall of past incidents + the org's documented runbooks.
Solution. Encode the investigation as a fixed multi-step workflow that mixes deterministic data-fetch steps with LLM-based reasoning steps. Each step has a narrow scope (one metric, one signal class, one correlation); the final step aggregates per-step outputs into a single root-cause reasoning pass. Orchestration is code — the LLM is a reasoning component inside a deterministic pipeline.
Canonicalised by Expedia STAR (2026-04-28):
"Overall, STAR provides multi-step workflows, which are visualized below. In specific: 1. It collects telemetry data. 2. It analyzes these metrics and the associated metadata using AI models and domain-specific prompts and rules. 3. It aggregates all analyses and conducts a final root cause analysis. 4. It returns insights and recommendations." (Source: sources/2026-04-28-expedia-expedias-service-telemetry-analyzer)
The four-step shape¶
(1) collect telemetry ──▶ (2) per-signal analysis ──▶ (3) aggregated RCA ──▶ (4) format + return
deterministic: LLM: N calls in parallel, LLM: 1 call that deterministic:
Datadog API one role + prompt per consumes (2)'s insights +
(metrics, metadata, signal class (latency, outputs as recommendations
Kubernetes, JVM) saturation, Kubernetes, generated knowledge
JVM, errors, traffic) (see concepts/
generated-knowledge-
prompting)
Step 1 and step 4 are deterministic code; step 2 is an LLM fanout (N metric classes × one prompt each); step 3 is a single LLM reasoning pass.
Why this shape¶
- Per-signal roles beat one-shot prompts. Running a "JVM heap analyst" prompt on heap metrics and a "Kubernetes probe analyst" prompt on probe data separately gives better per-signal interpretation than a single "RCA engineer" prompt trying to reason over all metric classes at once. See concepts/role-prompting.
- Aggregation step composes the intermediate outputs. The per-signal analyses are the "generated knowledge" in concepts/generated-knowledge-prompting — the final step reasons over the model's own intermediate work, not raw telemetry.
- Fanout scales independently. Adding a new signal class (logs, traces, business metrics) is a new step-2 entry, not a prompt rewrite.
- Deterministic ordering bounds failure modes. Not an agent loop; see patterns/static-prompt-chain-over-agent-loop.
Canonical Expedia STAR signal coverage¶
| Step-2 analyzer | Signals fed in |
|---|---|
| Traffic analyzer | inbound + outbound rates, errors |
| Latency analyzer | HTTP + gRPC + GraphQL per-protocol latencies |
| Saturation analyzer | container CPU + memory |
| Kubernetes analyzer | container restarts, probe failures |
| JVM analyzer | heap usage, GC |
STAR's infrastructure-first scope is deliberate — "our heterogeneous tech stack and the higher degree of standardization at the infrastructure layer" makes infra-level signals generalise across services in a way service-specific business metrics don't.
Use cases beyond incident-response¶
STAR reuses the same workflow shape for:
- Post-incident RCA — the same chain, run over a historical time window, produces an initial draft for a post-incident review.
- Troubleshooting runbooks — pre-documented runbooks (container restarts in Kubernetes, JVM heap spikes) are re-encoded as step-2 analyzer prompts.
- Performance optimization — a variant with longer time windows and tighter saturation analyzers.
- Failure-injection evaluation — complements concepts/chaos-engineering platforms with an automatic experiment-result evaluator.
One workflow shape, five applications. That reuse is what makes the fixed-chain approach economical — an agent loop that adapts per use case would need five sets of tool-selection + planning guardrails.
Comparison — adjacent patterns¶
- patterns/oncall-analyzer (Meta Presto) — sibling at a per-alert altitude. Meta's analyzers tie heuristic signals together for a specific alert class; STAR is LLM-based and works over arbitrary service windows.
- patterns/director-expert-critic-investigation-loop (Slack Spear) — higher-altitude sibling. Slack's multi-agent loop for deeper investigation across hundreds of inference calls; STAR is the static-chain alternative when that complexity isn't yet justified.
- patterns/hub-worker-dashboard-agent-service — adjacent architectural altitude for agent-based dashboards.
Related¶
- concepts/automated-root-cause-analysis — the parent discipline this pattern realises with LLMs.
- concepts/prompt-chaining — the primitive the workflow assembles.
- concepts/role-prompting — per-step persona framing.
- concepts/generated-knowledge-prompting — the step-3 composition technique.
- concepts/time-to-know-vs-time-to-recover — the KPIs the workflow compresses.
- concepts/observability — the upstream capability the workflow consumes.
- patterns/static-prompt-chain-over-agent-loop — the generalised architectural posture.
- patterns/oncall-analyzer — non-LLM sibling (Meta).
- patterns/director-expert-critic-investigation-loop — higher-altitude LLM sibling (Slack Spear).
- systems/expedia-star — canonical wiki consumer.
Seen in¶
- Expedia STAR (2026-04-28) — canonical wiki instance. The four-step workflow is the headline architecture of STAR; five use cases (incident investigation, post-incident RCA, troubleshooting runbooks, performance optimization, failure-injection evaluation) all run through it.