Skip to content

PATTERN Cited by 2 sources

Destructive operation confirmation as agent guardrail

A CLI tags destructive commands in its machine-readable catalog and fails any invocation of a destructive command that doesn't carry an explicit confirmation flag. The agent can't accidentally run a destructive command — it has to emit the confirmation flag, which is a plan-time choice it has to actively make (and a human reviewer can see in the agent's trace).

When to use this pattern

  • A CLI exposes commands whose effects are hard to reverse (delete, drop, force-push, reset, destroy, silence-alert, delete-slo, drop-synthetic-check).
  • Agents drive the CLI.
  • The cost of an accidental destructive action is high enough that a zero-cost tripwire is worth the minor friction.

The verbatim canonical statement

From the gcx launch post:

"It will find commands that result in destructive operations, which require explicit confirmation to reduce agent mistakes."

Paired with the machine-readable catalog: the CLI knows which commands are destructive; the agent has to know that knowledge and act accordingly.

Mechanism

  1. Catalog-time tagging. Every command in the machine-readable catalog carries a destructive: true (or equivalent) flag for mutating commands whose effects can't be trivially undone.
  2. Run-time gate. On invocation, if the command is tagged destructive and the confirmation flag is absent, the CLI exits non-zero with a documented error shape (concepts/exit-code-semantics) indicating confirmation-required.
  3. Plan-level agent response. The agent, informed by the catalog's destructive: true, emits the confirmation flag up-front — the destructive action becomes a plan time choice, not a silent runtime consequence.
  4. Human-visible trace. Because the confirmation flag is on the command-line, the agent's transcript shows it — reviewers can audit destructive actions taken.

Why it's zero-cost

The guardrail is purely catalog + flag:

  • No runtime permission model.
  • No per-command custom logic.
  • No prompt-engineering in the LLM.

The catalog is already the tool's authoritative surface description; tagging destructive commands is one metadata field per entry. Enforcement is a single check at invocation time.

Composition with the larger CLI-safety picture

Layer Where it sits
Agent-level prompt: "be careful" Unreliable — prompts aren't control (concepts/structured-output-reliability)
Read-only-agent-tool (patterns/allowlisted-read-only-agent-actions) Hard — narrow the tool to non-mutating commands only
CLI-safety-as-agent-guardrail (patterns/cli-safety-as-agent-guardrail) Narrow — Fly.io's mutation-MCP split
Destructive-op confirmation (this pattern) Zero-cost tripwire on full CLI — you can still run mutating commands but only with explicit flag
Human approval loop Highest-friction — reserve for irreversible infra ops

The patterns stack: an org can use destructive-op confirmation as the baseline layer on every CLI, layer mutation-MCP separation for agent-agent-driven flows, and reserve human approval for the highest-stakes operations.

Relationship to Fly.io's flyctl canonical

The Fly.io canonical provides mutation-MCP separation — mutating commands live in a separate MCP server that the agent must explicitly opt into. That's a stronger-invariant shape at the MCP-surface altitude.

The Grafana gcx approach is the direct-CLI altitude complement: the commands all live on one binary, but the catalog + flag requirement provide the same net effect as mutation-MCP-separation for agents that drive the CLI directly rather than through an MCP server. The two patterns are complementary, not alternatives.

Tradeoffs

  • Annotation overhead. Every new destructive command has to be tagged; the cost scales with command count but is trivial per command.
  • False negatives (missed tags). A mutating command that doesn't carry the tag can be run without confirmation. This is a catalog-completeness burden on the tool maintainers.
  • False positives (over-tagging). Tagging every mutating command — including ones that are trivially reversible like renames — adds friction without proportional safety benefit. The "destructive" boundary is a design choice, not a mechanical one.
  • Agent policy gap. The agent still decides whether to emit the confirmation flag. An agent configured to auto-confirm everything neutralises the guardrail; the zero-cost shape is necessary but not sufficient.

Seen in

Last updated · 433 distilled / 1,256 read