Skip to content

PATTERN Cited by 1 source

CLI safety as agent guardrail

Pattern

When wrapping a CLI as an MCP server (patterns/wrap-cli-as-mcp-server) and exposing mutating operations to an LLM agent, rely on the CLI's existing human-operator refusal invariants as the authorization boundary — instead of building an agent-specific policy layer at the MCP tier. The CLI already knows how to refuse unsafe operations ("can't destroy a mounted volume," "can't delete a non-empty bucket," "can't modify a resource that's locked," "can't downscale below minimum replicas during an outage"). Those refusals were authored to protect a human from fat-fingering in a terminal; they protect the agent user identically.

Canonical wiki statement

Sam Ruby, Fly.io, 2025-05-07:

"Since this support is built on flyctl, I would have received an error had I tried to destroy a volume that is currently mounted. Knowing that gave me the confidence to try the command."

(Source: sources/2025-05-07-flyio-provisioning-machines-using-mcps.)

Load-bearing: "Knowing that gave me the confidence to try the command." The CLI's pre-existing invariant is what let Ruby expose fly volumes destroy via MCP without bolting on an MCP- tier confirmation layer.

Why it's a valuable shape

Three properties of CLIs make this pattern cheap:

  1. Refusals are already implemented. Every production CLI has years of accumulated "don't let the operator shoot themselves in the foot" checks — typically invariants that Support / SRE escalations taught the CLI team to enforce. These invariants are exactly the ones an agent user also needs.
  2. Refusals are authored by humans who understood the domain. An MCP-tier guardrail author would need to re-derive them — "can a volume be destroyed while mounted? I think not, let me check" — duplicating work.
  3. Refusals are enforced at the right layer. The CLI sits between the agent and the cloud API. Even if the MCP server is compromised / prompt-injected / misconfigured, the CLI's invariants still hold because they're checks the underlying API exercise through flags and state. The guardrail is below the wrapper, not embedded in it.

Paired with structured-output reliability

This pattern is the mutation-side twin of concepts/structured-output-reliability — the read-side observation that "our 2020 decision to give flyctl --json mode became load-bearing for MCP in 2025." The mutation-side mirror is "our CLI's decade-old refusal-to-destroy-mounted- volumes invariant becomes load-bearing for mutation-authority MCP in 2025." Both are cases where mature CLI design pays an AI-integration dividend the original authors never intended.

Pattern elements

  1. Inherit the CLI's refusal logic unchanged. The MCP wrapper shells out to the CLI; exit code + stderr carry the refusal back to the agent; the agent reports it to the human.
  2. Surface refusals as tool-call failures, not silently retried errors. The MCP server should not try to "work around" an invariant (e.g. "let me umount then retry destroy") — that collapses the safety property. A refusal is a signal the agent should report to the human.
  3. Don't add MCP-tier --force flags. Resist the urge to expose bypasses. The invariant's whole value is that it holds under all callers; adding a force flag at the MCP tier reintroduces the failure mode the CLI invariant prevents.
  4. Rely on the CLI's pre-confirmation prompts sparingly. Some CLIs implement interactive confirmation ("are you sure? [y/n]") that an MCP subprocess can't satisfy without a flag (-y, --confirm, --no-prompt). The wrapper needs a policy on whether to pass the confirmation flag — passing it collapses the confirmation gate; not passing it breaks the tool. Better: prefer CLIs where safety is invariant-based (exit-with-error) not prompt-based (ask- human-at-stdin).

What this pattern does NOT cover

The flyctl-level "can't destroy a mounted volume" invariant answers the question "is this operation safe right now?" — not the question "is this what the user actually intended?" A prompt injection that redirects the agent to destroy the wrong unattached volume still succeeds; the CLI invariant doesn't know which volume the user meant.

Intent-confirmation is a different layer: - patterns/plan-then-apply-agent-provisioning — present the mutation plan first, gate on human approval. - concepts/elicitation-gate — per-tool-call approval dialogue. - patterns/allowlisted-read-only-agent-actions — drop mutations entirely and leave the read-only surface.

  • CLIs designed to be both human-ergonomic and agent-ergonomic (concepts/agent-ergonomic-cli) are the natural substrate for this pattern. Cloudflare's 2026-04 cf CLI is explicitly designed with agent ergonomics as a primary concern; Fly.io's flyctl arrived there by accident (2020 --json decision + pre-existing refusal invariants).
  • The pattern is not sufficient as a sole safety mechanism; the mutation-MCP posture still carries the workstation-local credential-inheritance risk, and any invariant-gap is a direct attack surface.

Seen in

Last updated · 200 distilled / 1,178 read