Skip to content

Fly.io — 30 Minutes With MCP and flyctl

Summary

Thomas Ptacek's internal-message-turned-blog post on building the "most basic" MCP server for flyctlflymcp — in 30 minutes. The server exposes two flyctl commands as MCP tools (fly logs, fly status) over MCP's stdio transport, using the mark3labs/mcp-go library; the whole implementation is "like 90 lines of code." The load-bearing precondition is a 2020 Fly.io design decision: most flyctl commands have a --json mode "to make them easier to drive from automation" (concepts/structured-output-reliability). The author points Claude at unpkg, the globally distributed CDN for the npm package registry running on Fly; within a few prompts Claude recites the 10-Machine regional topology, flags two machines in critical status, correlates with oom_killed: true event history, pulls logs for a critical machine, and produces a per-second incident timeline culminating in "a surge in requests or memory usage that pushed it over" a ~3.7 GB memory ceiling against 4 GB allocated. Ptacek's take: "annoyingly useful … faster than I find problems in apps." Closing caveat: a local MCP server that runs native binaries on the operator's workstation is a capability-escalation risk posture — safe for fly logs and fly status, would prefer to run inside an isolated environment (patterns/disposable-vm-for-agentic-loop).

Key takeaways

  1. CLI-as-MCP-server is a tiny wrapper"Because someone already wrote a really good Go MCP library, this whole thing is like 90 lines of code." Canonical wiki instance of patterns/wrap-cli-as-mcp-server. The MCP server just captures flyctl stdout; the LLM does the reasoning.
  2. Prior --json investment is load-bearing. "We may have gotten a little lucky, because we made a decision back in 2020 to give most of our flyctl commands a json mode to make them easier to drive from automation. I use that in my MCP server. I don't know how much of a difference it made." Six-year-old automation-friendliness decision became an AI-integration- readiness decision. Extends concepts/structured-output-reliability and the concepts/agent-ergonomic-cli framing.
  3. stdio MCP transport is the path of least resistance. Fly chose MCP's stdio transport because "I'm lazy" — the MCP server launches flyctl, captures output, returns it to the LLM. No HTTP/SSE session-affinity concerns (concepts/mcp-long-lived-sse), no auth layer; the whole server inherits the operator's existing flyctl credentials.
  4. Two-tool surface is deliberately minimal. Only fly logs
  5. fly status. Sibling framing to patterns/tool-surface-minimization
  6. patterns/allowlisted-read-only-agent-actions — read-only observability primitives mean the blast radius of LLM hallucination is bounded to "wrong conclusion about the state of my app", not "machine got destroyed."
  7. Fast-useful emergent incident diagnosis. Pointed at unpkg, Claude produced without further prompting: (a) the global topology (10 Machines across lax/atl/ewr/lhr/cdg/ ams/sin/nrt/hkg/bog/syd — 3 NA, 3 EU, 3 Asia, 1 SA, 1 Oceania), (b) criticality classification ("2 machines are in 'critical' status: One in ewr (Newark) with 'context deadline exceeded' error One in atl (Atlanta) with 'gone' status"), (c) oom_killed event correlation. Then on "try getting logs for one of the critical machines" it pulled the kernel OOM kill line — "Out of memory: Killed process 641 (bun) total-vm:85950964kB, anon-rss:3744352kB, file-rss:12032kB, shmem-rss:0kB, UID:0 pgtables:16908kB oom_score_adj:0" — and the full recovery timeline (SIGKILL → reboot: Restarting system → health check fail → listener up → health check pass at 20:47:46). Canonical instance of concepts/agentic-troubleshooting-loop driven by a two-tool CLI-MCP surface.
  8. Quantified incident fact disclosed: Bun process was consuming "about 3.7 GB of memory (out of the 4 GB allocated)" when oom-killed; total recovery window in the log was ~43 seconds (20:47:03 oom → 20:47:46 healthy).
  9. Local MCP server is a scary security shape. "Local MCP servers are scary. I don't like that I'm giving a Claude instance in the cloud the ability to run a native program on my machine. I think fly logs and fly status are safe, but I'd rather know it's safe. It would be, if I was running flyctl in an isolated environment and not on my local machine." Canonical wiki statement of concepts/local-mcp-server-risk; Fly.io's own patterns/disposable-vm-for-agentic-loop (2025-02-07 VSCode-SSH post) is the natural answer.

Extracted systems

Extracted concepts

Extracted patterns

Operational numbers

  • Code size: ~90 lines of Go for the MCP server.
  • Tool count: 2 (fly logs, fly status).
  • Build time: ~30 minutes wall-clock.
  • Transport: MCP stdio (not HTTP/SSE).
  • unpkg topology observed: 10 Fly Machines, 11 regions named (lax, atl, ewr, lhr, cdg, ams, sin, nrt, hkg, bog, syd). App platform version: machines. Runtime: Bun.
  • Memory budget: 4 GB allocated; Bun process at ~3.7 GB before kill. Kernel RSS numbers: anon-rss:3744352kB = ~3.57 GiB resident anonymous memory.
  • Oom recovery window: 20:47:03 → 20:47:46 = 43 s end-to-end (OOM kill → reboot → listener up → health-check pass).
  • Prior-decision vintage: flyctl --json mode decision = 2020 (5 years earlier than the post).

Caveats

  • Opinion post, not an architectural deep-dive. Load-bearing architectural content is the shape of the wrap + the precondition of --json mode — not internals of MCP or flyctl.
  • No p50/p95 for flymcp tool-call latency; no token-budget accounting on Claude's side.
  • No comparison with alternative MCP wrap targets (kubectl, aws, gcloud, gh) that also have JSON modes — the pattern is clearly generalisable but the post doesn't claim generality.
  • The unpkg incident diagnosis is convincing but single-case; Claude's narrative accuracy on novel incidents (not OOM) is not measured.
  • Stdio transport elides every interesting MCP-in-production concern (multitenancy, session affinity, auth, rate limits); those only show up once the MCP server runs somewhere other than the operator's shell.
  • "Local MCP server is scary" is a hand-waved posture statement; Ptacek does not propose a specific sandbox design in this post (the VSCode-SSH-bananas post from two months earlier is the companion piece sketching disposable-VMs-for-agents).
  • Tier-3 source under strict scope filter; admitted into the wiki on the strength of the wrap-CLI-as-MCP pattern, the --json-mode-as-precondition framing, the concrete incident- diagnosis demonstration, and the local-MCP-risk posture.

Source

Last updated · 200 distilled / 1,178 read