Skip to content

FLYIO 2025-03-27 Tier 3

Read original ↗

Fly.io — Operationalizing Macaroons

Summary

Thomas Ptacek's 2025-03-27 retrospective written as Fly.io hands off internal ownership of the Macaroon stack to a new owner. Two-years-in assessment: the user-facing pitch for Macaroons (users can edit their own tokens, email them to partners) is a mixed bag — "users don't really take advantage of token features" — but Fly.io has gotten a pile of infrastructure wins internally, to the point that the token system is "one of the nicer parts of our platform." The post is a ground- tour of how they built it: an isolated 5000-line Go service called tkdb managing a SQLite database fronted by LiteFS (subsecond replication US→EU/AU + primary failover) and Litestream (PITR to object storage); a hipster-implemented Noise channel layered on HTTP for RPC; client-side verification caching hitting >98%; a revocation subscription feed that lets clients prune caches reactively rather than polling; a dedicated third-party- caveat-strip API for service tokens that keeps authenticator "hazmat" off worker hosts; and comprehensive OpenTelemetry + Honeycomb + permanent-OpenSearch audit trail. Two companion observations: (1) Macaroons are online-stateful"you need a database somewhere" — so architectural quality is all about where that database lives; and (2) "[[concepts/keep-hazmat-away-from- complex-code|a basic rule of thumb in secure design is: keep hazmat away from complicated code]]" is the post's explicitly-stated design principle. Fly.io is self-described as "allergic to microservices" but concedes tkdb (and a sibling secret service Pet Semetary) "pulled their weight" — canonical wiki instance of the narrow-purpose-security-microservice carve-out from a microservices-skeptical culture.

Key takeaways

  1. Macaroons are online-stateful — tkdb is the DB that backs them. "A Macaroon token starts with a random field (a nonce) and the first thing you do when verifying a token is to look that nonce up in a database." The most important Fly.io architectural decision was where that DB lives: not in the primary API cluster. Two reasons stated: (a) scalability + reliability — "far and away the most common failure mode of an outage on our platform is 'deploys are broken' … It would not be OK if 'deploys are broken' transitively meant 'deployed apps can't use security tokens.'" (b) security — "root secrets for Macaroon tokens are hazmat, and a basic rule of thumb in secure design is: keep hazmat away from complicated code." (Source: article body.)
  2. tkdb — 5,000 lines of Go managing SQLite via LiteFS + Litestream. Runs on isolated hardware in US, EU, and AU regions; DB records encrypted with an injected secret. LiteFS gives subsecond US→EU/AU replication + primary failover; Litestream gives PITR to object storage. "The entire tkdb database is just a couple dozen megs large. Most of that data isn't real. A full PITR recovery of the database takes just seconds. We use SQLite for a lot of our infrastructure, and this is one of the very few well-behaved databases we have." Writes are rare — only two write paths in practice: (a) HMAC root-key insertion when a Fly.io org is created, and (b) revocation-list appends. (Source: article body.)
  3. The database is small because Macaroons attenuate offline. "There's actually not much for us to store! The most complicated possible Macaroon still chains up to a single root key (we generate a key per Fly.io 'organization'; you don't share keys with your neighbors), and everything that complicates that Macaroon happens 'offline'. We take advantage of 'attenuation' far more than our users do." Offline attenuation means additional caveats get hashed onto the chain by whoever is using the token, without touching tkdb — so adding restrictions scales linearly with use rather than with token-system throughput.
  4. Transport: HTTP/Noise, deliberately over-engineered for security-context isolation. History: tkdb originally exported an RPC API over NATS, but "Our product security team can't trust NATS (it's not our code). That means a vulnerability in NATS can't result in us losing control of all our tokens, or allow attackers to spoof authentication." Plain RPC over NATS would let an attacker spoof "yes this token is fine" messages. Solution: implement Noise directly. Verification uses Noise_IK (like TLS — "anybody can verify, but everyone needs to prove they're talking to the real tkdb"); signing uses Noise_KK (like mTLS — "only a few components in our system can mint tokens, and they get a special client key"). A later migration replaced the NATS transport with HTTP but kept the Noise layer "out of laziness""a design smell, but the security model is nice: across many thousands of machines, there are only a handful with the cryptographic material needed to mint a new Macaroon token." Canonical wiki instance of patterns/noise-over-http.
  5. Routing: tkdb is a Fly App in isolated regions reached over FlyCast. Clients in Singapore route to AU; AU-out failover is transparent; the tkdb client library does exponential-backoff retry. Even with all that routing, "we don't like that Macaroon token verification is 'online'""when you operate a global public cloud one of the first things you learn is that the global Internet sucks."
  6. Verification cache — >98% hit rate. Property: chained-HMAC implies a verified parent token verifies any descendant. "Macaroons, as it turns out, cache beautifully. That's because once you've seen and verified a Macaroon, you have enough information to verify any more-specific Macaroon that descends from it; that's a property of their chaining HMAC construction. Our client libraries cache verifications, and the cache ratio for verification is over 98%." The 98% is a named reliability engineering win: "transoceanic links" on the auth path are mostly avoided in steady state.
  7. Revocation is first-class — "revocation isn't a corner case, it can't be an afterthought." Concrete SQL schema in the post:
    CREATE TABLE IF NOT EXISTS blacklist (
      nonce          BLOB NOT NULL UNIQUE,
      required_until DATETIME,
      created_at     DATETIME DEFAULT CURRENT_TIMESTAMP
    );
    
    Revocation is nonce-level: "revoke takes the random nonce from the beginning of the Macaroon, discarding the rest, and adds it to the blacklist. Every Macaroon in the lineage of that nonce is now dead." If revocation "doesn't work reliably, you wind up with 'cosmetic logout', which is a real vulnerability. When we kill a token, it needs to stay dead."
  8. Revocation propagates via a subscription feed, not by distributing the blacklist. "We certainly don't want to propagate the blacklist database to 35 regions around the globe." Instead, the tkdb verification API exports a "feed of revocation notifications" that clients subscribe to (polled in practice). When revocations arrive, clients prune their caches. Fail-closed behavior on connectivity loss: if clients lose connectivity past a threshold interval, "they just dump their entire cache, forcing verification to happen at tkdb." This is the canonical inversion of distribute-blacklist- globally — distribute invalidations instead.
  9. Service tokens — strip the third-party authentication caveat via tkdb. Macaroons by themselves express authorization, not authentication. A user's Fly.io Macaroon has a third-party caveat that says "this token is only valid if accompanied by the discharge token for a user in your organization from our authentication system." For running-code service tokens, you don't want the authenticator token living next to the code. Solution: "tkdb exports an API that uses its access to token secrets to strip off the third-party authentication caveat. To call into that API, you have to present a valid discharging authentication token; that is, you have to prove you could already have done whatever the token said." Canonical patterns/third-party-caveat-strip-for-service-token. The returned service token has no expiration ("you don't usually want service tokens to expire"), but: the recipient can attenuate further — "lock it to a particular instance of flyd, or to a particular Fly Machine." Net: exfiltrating the service token doesn't help an attacker unless they also control the environment it's bound to. Traceability property: "every token used in production is traceable in some way to a valid token a user submitted."
  10. Pet Semetary — Fly's internal Vault replacement — uses the same third-party-caveat trick for secret reads. Pet Semetary is its own Macaroon authority managing user secrets (e.g., Postgres connection strings). Flyd (the orchestrator running on every worker) needs to inject secrets into Fly Machines at boot, but giving every flyd a read-every-user's-secret Macaroon collapses isolation down to "every worker is equally privileged". Solution: the "read secret" Macaroon flyd holds has a third-party caveat dischargeable only by proving, via normal Macaroon tokens, that you have the org's permissions. Access is traceable to an end-user action and minimized across the fleet.
  11. Telemetry: OpenTelemetry + Honeycomb + permanent-retention OpenSearch audit trail. Ptacek's explicit retraction of prior skepticism: "Once, I was an '80% of the value of tracing, we can get from logs and metrics' person. But I was wrong." OTel's context propagation gives "a single narrative about what's happening" from API server to tkdb. "The tkdb code is remarkably stable and there hasn't been an incident intervention with our token system in over a year." The audit trail is itself architecturally load-bearing: "virtually all the operations that happen on our platform are mediated by Macaroons, [so] this audit trail is itself pretty powerful."
  12. Culture → design disclosure. "As an engineering culture, we're allergic to 'microservices', and we flinched a bit at the prospect of adding a specific service just to manage tokens. But it's pulled its weight, and not added really any drama at all. We have at this point a second dedicated security service (Petsem), and even though they sort of rhyme with each other, we've got no plans to merge them. Rule #10 and all that." Explicit: narrow-purpose security services justify the carve- out from a microservices-averse default.
  13. Infrastructure-SQLite endorsement — earned, not assumed. Closing note: "a total victory for LiteFS, Litestream, and infrastructure SQLite. Which, after managing an infrastructure SQLite project that routinely ballooned to tens of gigabytes and occasionally threatened service outages, is lovely to see." The "infrastructure SQLite project that ballooned to tens of gigabytes" reference is almost certainly corrosion/corrosion2 — making this post an implicit contrast to corrosion2's operational footprint, where tkdb's dozens-of-megs is the happy counter-example.

Extracted systems

  • systems/tkdb — NEW. Fly.io's isolated Macaroon token service. 5000 lines of Go, SQLite-backed, LiteFS-replicated, Litestream-PITR'd. HTTP/Noise RPC.
  • systems/litefs — NEW. Primary/replica distributed SQLite substrate; subsecond US→EU/AU replication + primary failover for tkdb.
  • systems/litestream — NEW. Point-in-time SQLite replication to object storage; Fly uses it for tkdb recovery.
  • systems/sqlite — EXTEND. New role: root-of-trust store for a token authority. Adds to the Fly.io multi-SQLite-role story.
  • systems/petsem — NEW. "Pet Semetary". Fly's in-house Vault replacement; its own Macaroon authority.
  • systems/macaroon-superfly — NEW. github.com/superfly/ macaroon — Fly.io's open-source Macaroon implementation (Go).
  • systems/nats — EXTEND. Additional retirement datapoint: tkdb RPC migrated NATS → HTTP (but kept Noise).
  • systems/honeycomb — NEW. Distributed-tracing backend Fly.io uses; Ptacek's explicit endorsement after prior skepticism.
  • systems/opentelemetry — NEW. Fly.io's tracing standard; OTel context propagation gives single-narrative request tracing for tkdb.
  • systems/flycast — EXTEND. Named here as the routing substrate for tkdb's multi-region isolated Fly-App deployment.

Extracted concepts

Extracted patterns

Operational numbers

  • tkdb code size: ~5,000 lines of Go. (Source: article body.)
  • tkdb SQLite database size: "a couple dozen megs". Most of that data isn't real (test/noise).
  • PITR full recovery: "just seconds".
  • Client verification cache hit rate: >98% ("over 98%").
  • LiteFS replication: subsecond US primary → EU + AU replicas.
  • Regions: 3 isolated — US, EU, AU.
  • Incident intervention count in token system: 0 in over a year.
  • Reachability mechanism: FlyCast Anycast routing (Singapore → AU; AU-out → closest backup).
  • Noise patterns used: Noise_IK for verification (server-auth only — like TLS); Noise_KK for signing (mutual-auth — like mTLS).
  • Global deploy surface: "35 regions around the globe" (stated to justify not replicating the blacklist globally).
  • Internal Macaroon library LOC & scope: most is open source at github.com/superfly/macaroon.

Caveats and things not disclosed

  • Exact SQLite schema beyond the blacklist table is not shown (root-key table, signing-key tables, etc. — not dumped).
  • Noise keypair provisioning — how the "handful" of signing clients get their Noise_KK client key at bootstrap is not documented.
  • Revocation-feed mechanics — named as "subscribe (really, polls)"; poll interval not disclosed. Threshold for cache-dump on disconnect not disclosed.
  • Macaroon library metrics — only "Most of the code is open source"; unclear what isn't.
  • The SQLite-ballooned-to-tens-of-gigs project referenced in the closing is implied (very likely corrosion) but not named.
  • Audit-trail scale — index size, retention policy details, query latency on the OpenSearch cluster not disclosed.
  • Macaroon-feature-underuse — the "users don't really take advantage of token features" claim is qualitative; no usage- distribution breakdown.
  • Pet Semetary architecture — mentioned only as a second instance of the pattern; its own deep dive is not in this post.
  • Deployment isolation"Fly-only isolated regions" for tkdb are named but not defined (worker pool? region? account isolation?).
  • Comparative performance — no p50/p99 numbers for signing or verification round-trips; no latency budget quoted.

Cross-references within Fly.io wiki

This is the ninth Fly.io ingest on this wiki (after FKS, Tigris, JIT WireGuard, AWS OIDC, L40S, Making-Machines-Move, Livebook/FLAME, VSCode-SSH-bananas, the Exit Interview, and the Rust-Proxy incident). It is the deepest architectural disclosure of Fly.io's security-token stack to date — prior Macaroon coverage on this wiki lived almost entirely in sources/2024-06-19-flyio-aws-without-access-keys (workload- identity / OIDC-federation angle). This post is the complementary half: inside the Fly.io-to-Fly.io trust boundary rather than Fly.io-to-AWS.

Source

Last updated · 200 distilled / 1,178 read