Skip to content

SYSTEM Cited by 3 sources

NATS

NATS (nats.io) is an open-source messaging system — publish / subscribe + request / reply + queue groups, with an emphasis on being fast, simple, and lightweight. The wiki tracks it where it appears as an internal-messaging bus in ingested posts.

At Fly.io (2022–2024)

Fly.io was heavily invested in NATS internally in 2022 and later migrated control-plane traffic off it. The 2024-03-12 JIT WireGuard post explicitly narrates the retirement direction:

"NATS is fast, but doesn't guarantee delivery. Back in 2022, Fly.io was pretty big on NATS internally. We've moved away from it. For instance, our internal flyd API used to be driven by NATS; today, it's HTTP. Our NATS cluster was losing too many messages to host a reliable API on it." (Source: sources/2024-03-12-flyio-jit-wireguard-peers)

The specific failure mode on the WireGuard path

In the push-based peer-provisioning flow, the Fly GraphQL API forwarded every new peer config to the appropriate gateway over NATS. When NATS lost the RPC, flyctl would receive its peer config back from the API — implying the peer was installed — but the gateway had never heard of the peer, so the WireGuard connection would stall.

The fix was two-pronged:

  1. Architectural. Stop pushing configs at all — have the gateway pull them from the API on handshake arrival. Canonical patterns/pull-on-demand-replacing-push instance; sidesteps the delivery-guarantee problem entirely because the pull is the thing that needs to happen before the next step anyway.
  2. Tactical. Migrate individual internal RPCs off NATS onto HTTP (named example in the post: flyd). Retire NATS from the critical path.

NATS JetStream as a Litestream replica type (2025-10-02)

A separate wiki role surfaced by the 2025-10-02 Litestream v0.5.0 shipping post (sources/2025-10-02-flyio-litestream-v050-is-here): NATS JetStream — NATS core's persistent + at-least-once layer, distinct from the at-most-once core-NATS that Fly.io retired — is now a first-class replica type for Litestream alongside S3 / GCS / Azure Blob:

"We've also added a replica type for NATS JetStream. Users that already have JetStream running can get Litestream going without adding an object storage dependency."

The framing is interesting: JetStream's persistence + at-least- once + replay semantics cover the same semantic surface that object-store conditional writes give to the CASAAS lease (patterns/conditional-write-lease) for users whose cluster already runs JetStream as a coordination / streaming substrate. This is the first wiki instance of NATS JetStream specifically (as distinct from the core-NATS retirement datapoints that dominate the rest of this page); the NATS-as-archive-sink role is a different value proposition from the NATS-as-RPC-transport role Fly retired.

Delivery semantics

NATS core is at-most-once — no built-in acknowledgement, no retry, no persistence between publisher and consumer. This is the property that bit Fly.io's WireGuard provisioning. NATS JetStream (a separate, opt-in layer) provides at-least-once + persistence, but Fly.io's 2022-era deployment was on core NATS. The lesson the post draws is simpler than "should have used JetStream" — the lesson is don't push state you'll have to pull anyway.

Seen in

  • sources/2024-03-12-flyio-jit-wireguard-peers — canonical wiki instance; NATS as the dropped-push RPC transport that triggered the JIT rewrite.
  • sources/2025-03-27-flyio-operationalizing-macaroons — second Fly.io NATS-retirement datapoint; tkdb's RPC interface originally ran over NATS, wrapped in Noise because "NATS is a message bus, not a streaming secure channel" and "Our product security team can't trust NATS (it's not our code). That means a vulnerability in NATS can't result in us losing control of all our tokens." JP Phillips's team later replaced NATS with HTTP; the Noise layer stayed (canonical instance of patterns/noise-over-http).
  • sources/2025-10-02-flyio-litestream-v050-is-hereNATS JetStream as a Litestream replica type. First wiki instance of JetStream specifically (vs core NATS). Not in the retirement narrative — this is a positive-use datapoint for the persistent / at-least-once / replay layer that sits alongside S3 / GCS / Azure Blob as a Litestream v0.5.0 archive sink.
Last updated · 200 distilled / 1,178 read