CONCEPT Cited by 3 sources
Agentic troubleshooting loop¶
Definition¶
An agentic troubleshooting loop is an investigation pattern in which an LLM is the planner and a narrow-surface tool assistant is the hands, iterating through a cycle of:
- retrieve historical telemetry relevant to the user's query (vector search, log search, metric query),
- LLM proposes the next diagnostic action (typically a
platform-native read command —
kubectl describe,aws ecs describe-task,SELECT … FROM events), - tool assistant executes the action, returning raw output,
- LLM re-reads the output in context,
- LLM decides: continue the investigation (emit another action) or conclude (synthesize a resolution),
repeating (2)–(5) until the LLM's stopping criterion fires.
The canonical wiki reference is the AWS EKS conversational-
observability blueprint
(sources/2025-12-11-aws-architecting-conversational-observability-for-cloud-applications),
which pairs stored telemetry (embedded in
OpenSearch Serverless or S3 Vectors) with
live kubectl output from an in-cluster troubleshooting assistant,
and loops until the LLM judges enough context for resolution.
Why it's not just "RAG"¶
Pure RAG is single-shot: retrieve → augment prompt → generate answer once. An agentic loop differs on three axes:
- Multi-turn action-observation — each LLM turn can call a tool, read the result, and then call another tool. The investigation shape is closer to a REPL than to a Q&A.
- Real-time state, not just stored context — the tool assistant queries live system state (running pods, current ELB health, DB connection counts) that was never in the embedding corpus. The combination of stored telemetry + live state is the point.
- LLM-driven termination — the loop stops when the LLM decides it has enough context, not when a deterministic rule fires. This is both the pattern's power (adapts to investigation shape) and its failure mode (can loop uselessly, exceed token budgets, or stop too early).
See concepts/observability for the broader "agent-assisted debugging layer" above the metrics/logs/traces triad.
Structural components¶
- A planner LLM with access to (a) the user's query, (b) a tool-description schema listing what it can do, (c) retrieved telemetry context, (d) growing conversation history.
- A retrieval tier — vector search over embedded telemetry, log search, metric query — returning relevant historical signal.
- A tool assistant with a constrained action surface — typically implementing patterns/allowlisted-read-only-agent-actions so the loop cannot mutate production state.
- A context budget manager — the concepts/agent-context-window fills up with tool outputs; long investigations need summarization, sliding windows, or tool-output truncation.
- A termination contract — explicit prompt asking the LLM "do you have enough information to conclude?" after each tool execution; iteration cap as a safety net.
Shapes in production¶
RAG-orchestrated variant. Chatbot app drives the loop: runs retrieval, calls LLM, parses LLM's suggested commands, routes them to the assistant, reads output, re-prompts the LLM. Simple, all orchestration code is in the chatbot. AWS reference-architecture blueprint uses this shape.
MCP / agentic variant. LLM-centric agents (using MCP or similar) discover tools dynamically, call them in any order, and the orchestration lives in the agent framework. Strands Agents SDK + EKS MCP Server is the AWS example; often paired with patterns/specialized-agent-decomposition (e.g. Agent Orchestrator + Memory Agent + K8s Specialist). Less orchestration code, more protocol ceremony.
AWS-managed variant. AWS DevOps Agent adds a two-path resource-discovery step (concepts/telemetry-based-resource-discovery), baseline learning, and confidence-scored root-cause ranking. The iterative loop is still underneath, but the framing shifts from "the LLM decides when to stop" to "the agent produces ranked RCA candidates with evidence".
Named failure modes¶
- Loop non-termination — LLM keeps proposing one more command; iteration cap required.
- Context exhaustion — raw kubectl outputs can be dozens of KB each; the conversation blows through the window. Mitigations: summarize prior turns, truncate large outputs, use a smaller focused retrieval query.
- Wrong tool choice — larger tool inventories lower selection accuracy (concepts/tool-selection-accuracy); specialized agent decomposition helps.
- Hallucinated commands — LLM proposes a command the allowlist rejects; the loop has to handle rejection gracefully without spiraling.
- Stale-vs-live signal conflict — stored telemetry says the
pod is healthy, live kubectl says
CrashLoopBackOff; reconciling these requires the LLM to prefer live state, which is prompt discipline not structural enforcement.
Relationship to other concepts¶
- concepts/observability — this is the "agent-assisted debugging layer" instantiated as an investigation loop.
- concepts/telemetry-based-resource-discovery — a more structured discovery methodology that plugs into the first retrieval step of the loop when the investigation is topology-dependent.
- concepts/agent-context-window — the hard scarce resource inside the loop; every tool output eats it.
- patterns/allowlisted-read-only-agent-actions — the canonical safety discipline around the loop's action surface.
- patterns/telemetry-to-rag-pipeline — how the stored- telemetry substrate the loop retrieves from is built.
- patterns/specialized-agent-decomposition — how the loop's internal state is partitioned across multiple agents when tool inventory grows.
Seen in¶
- sources/2025-12-11-aws-architecting-conversational-observability-for-cloud-applications — the canonical wiki reference. Two deployments of the loop (RAG-orchestrated + Strands + MCP). Stopping criterion: "Based on the output, the chatbot asks LLM to decide whether to continue investigation (by asking the agent to run more commands), or whether it has enough context to produce an answer."
- sources/2025-04-10-flyio-30-minutes-with-mcp-and-flyctl —
CLI-MCP-driven compact instantiation. Fly.io's 90-LoC
flymcp exposes just two tools (
fly logs,fly status) over MCP stdio. Pointed at unpkg, Claude reconstructs the 10-Machine regional topology, flags 2 machines in critical status, correlatesoom_killed: trueevents, pulls logs on follow-up, and produces a per-second incident timeline (OOM kill → SIGKILL → reboot → health-check fail → listener up → health-check pass, ~43s end-to-end; Bun process at ~3.7 GB of 4 GB allocated). Demonstrates the loop works with a minimal tool surface (patterns/tool-surface-minimization) when the underlying CLI has good--jsonoutput. Ptacek: "annoyingly useful … faster than I find problems in apps." Sits downstream of patterns/wrap-cli-as-mcp-server and surfaces the concepts/local-mcp-server-risk posture as a structural concern about the loop's substrate, not the loop itself. - sources/2025-05-07-flyio-provisioning-machines-using-mcps — loop extended into provisioning-hygiene. Ruby's 2025-05-07 post adds a mutation-side instance: "I asked for a list of volumes for an existing app, and Claude noted that I had a few volumes that weren't attached to any machines. So I asked it to delete the oldest unattached volume, and it did so." The agent surfaces a resource-hygiene finding the human didn't specifically ask about, the human reacts with a mutation. Same planner-executor shape as the 2025-04-10 observability loop but with write authority. Pairs with concepts/natural-language-infrastructure-provisioning + patterns/cli-safety-as-agent-guardrail.