Skip to content

PATTERN Cited by 1 source

Multi-step RCA workflow

Problem. On-call engineers investigating a service degradation or outage need to (a) pull the relevant telemetry, (b) interpret each signal, (c) combine per-signal findings into a probable root cause hypothesis, and (d) produce something that's actionable within the incident's TTR budget. The work is cognitive, repetitive across incidents, and bound by the engineer's recall of past incidents + the org's documented runbooks.

Solution. Encode the investigation as a fixed multi-step workflow that mixes deterministic data-fetch steps with LLM-based reasoning steps. Each step has a narrow scope (one metric, one signal class, one correlation); the final step aggregates per-step outputs into a single root-cause reasoning pass. Orchestration is code — the LLM is a reasoning component inside a deterministic pipeline.

Canonicalised by Expedia STAR (2026-04-28):

"Overall, STAR provides multi-step workflows, which are visualized below. In specific: 1. It collects telemetry data. 2. It analyzes these metrics and the associated metadata using AI models and domain-specific prompts and rules. 3. It aggregates all analyses and conducts a final root cause analysis. 4. It returns insights and recommendations." (Source: sources/2026-04-28-expedia-expedias-service-telemetry-analyzer)

The four-step shape

(1) collect telemetry ──▶ (2) per-signal analysis ──▶ (3) aggregated RCA ──▶ (4) format + return
    deterministic:           LLM: N calls in parallel,   LLM: 1 call that    deterministic:
    Datadog API              one role + prompt per       consumes (2)'s      insights +
    (metrics, metadata,      signal class (latency,      outputs as          recommendations
    Kubernetes, JVM)         saturation, Kubernetes,     generated knowledge
                             JVM, errors, traffic)       (see concepts/
                                                         generated-knowledge-
                                                         prompting)

Step 1 and step 4 are deterministic code; step 2 is an LLM fanout (N metric classes × one prompt each); step 3 is a single LLM reasoning pass.

Why this shape

  • Per-signal roles beat one-shot prompts. Running a "JVM heap analyst" prompt on heap metrics and a "Kubernetes probe analyst" prompt on probe data separately gives better per-signal interpretation than a single "RCA engineer" prompt trying to reason over all metric classes at once. See concepts/role-prompting.
  • Aggregation step composes the intermediate outputs. The per-signal analyses are the "generated knowledge" in concepts/generated-knowledge-prompting — the final step reasons over the model's own intermediate work, not raw telemetry.
  • Fanout scales independently. Adding a new signal class (logs, traces, business metrics) is a new step-2 entry, not a prompt rewrite.
  • Deterministic ordering bounds failure modes. Not an agent loop; see patterns/static-prompt-chain-over-agent-loop.

Canonical Expedia STAR signal coverage

Step-2 analyzer Signals fed in
Traffic analyzer inbound + outbound rates, errors
Latency analyzer HTTP + gRPC + GraphQL per-protocol latencies
Saturation analyzer container CPU + memory
Kubernetes analyzer container restarts, probe failures
JVM analyzer heap usage, GC

STAR's infrastructure-first scope is deliberate — "our heterogeneous tech stack and the higher degree of standardization at the infrastructure layer" makes infra-level signals generalise across services in a way service-specific business metrics don't.

Use cases beyond incident-response

STAR reuses the same workflow shape for:

  • Post-incident RCA — the same chain, run over a historical time window, produces an initial draft for a post-incident review.
  • Troubleshooting runbooks — pre-documented runbooks (container restarts in Kubernetes, JVM heap spikes) are re-encoded as step-2 analyzer prompts.
  • Performance optimization — a variant with longer time windows and tighter saturation analyzers.
  • Failure-injection evaluation — complements concepts/chaos-engineering platforms with an automatic experiment-result evaluator.

One workflow shape, five applications. That reuse is what makes the fixed-chain approach economical — an agent loop that adapts per use case would need five sets of tool-selection + planning guardrails.

Comparison — adjacent patterns

Seen in

  • Expedia STAR (2026-04-28) — canonical wiki instance. The four-step workflow is the headline architecture of STAR; five use cases (incident investigation, post-incident RCA, troubleshooting runbooks, performance optimization, failure-injection evaluation) all run through it.
Last updated · 433 distilled / 1,256 read