Fly.io — Operationalizing Macaroons¶
Summary¶
Thomas Ptacek's 2025-03-27 retrospective written as Fly.io hands off
internal ownership of the Macaroon stack to a new owner. Two-years-in
assessment: the user-facing pitch for
Macaroons (users can edit their own tokens,
email them to partners) is a mixed bag — "users don't really take
advantage of token features" — but Fly.io has gotten a pile of
infrastructure wins internally, to the point that the token system is
"one of the nicer parts of our platform." The post is a ground-
tour of how they built it: an isolated 5000-line Go service called
tkdb managing a SQLite database fronted by
LiteFS (subsecond replication US→EU/AU + primary
failover) and Litestream (PITR to object
storage); a hipster-implemented Noise
channel layered on HTTP for RPC; client-side verification caching
hitting >98%; a revocation subscription feed that lets clients
prune caches reactively rather than polling; a dedicated
third-party-
caveat-strip API for service tokens that keeps authenticator
"hazmat" off worker hosts; and comprehensive OpenTelemetry +
Honeycomb + permanent-OpenSearch audit trail. Two companion
observations: (1) Macaroons are
online-stateful — "you need a
database somewhere" — so architectural quality is all about where
that database lives; and (2) "[[concepts/keep-hazmat-away-from-
complex-code|a basic rule of thumb in secure design is: keep hazmat
away from complicated code]]" is the post's explicitly-stated design
principle. Fly.io is self-described as "allergic to microservices"
but concedes tkdb (and a sibling secret service
Pet Semetary) "pulled their weight" — canonical
wiki instance of the narrow-purpose-security-microservice
carve-out from a microservices-skeptical culture.
Key takeaways¶
- Macaroons are online-stateful —
tkdbis the DB that backs them. "A Macaroon token starts with a random field (a nonce) and the first thing you do when verifying a token is to look that nonce up in a database." The most important Fly.io architectural decision was where that DB lives: not in the primary API cluster. Two reasons stated: (a) scalability + reliability — "far and away the most common failure mode of an outage on our platform is 'deploys are broken' … It would not be OK if 'deploys are broken' transitively meant 'deployed apps can't use security tokens.'" (b) security — "root secrets for Macaroon tokens are hazmat, and a basic rule of thumb in secure design is: keep hazmat away from complicated code." (Source: article body.) tkdb— 5,000 lines of Go managing SQLite via LiteFS + Litestream. Runs on isolated hardware in US, EU, and AU regions; DB records encrypted with an injected secret. LiteFS gives subsecond US→EU/AU replication + primary failover; Litestream gives PITR to object storage. "The entiretkdbdatabase is just a couple dozen megs large. Most of that data isn't real. A full PITR recovery of the database takes just seconds. We use SQLite for a lot of our infrastructure, and this is one of the very few well-behaved databases we have." Writes are rare — only two write paths in practice: (a) HMAC root-key insertion when a Fly.io org is created, and (b) revocation-list appends. (Source: article body.)- The database is small because Macaroons
attenuate offline. "There's
actually not much for us to store! The most complicated possible
Macaroon still chains up to a single root key (we generate a key
per Fly.io 'organization'; you don't share keys with your
neighbors), and everything that complicates that Macaroon happens
'offline'. We take advantage of 'attenuation' far more than our
users do." Offline attenuation means additional caveats get
hashed onto the chain by whoever is using the token, without
touching
tkdb— so adding restrictions scales linearly with use rather than with token-system throughput. - Transport: HTTP/Noise, deliberately over-engineered for
security-context isolation. History:
tkdboriginally exported an RPC API over NATS, but "Our product security team can't trust NATS (it's not our code). That means a vulnerability in NATS can't result in us losing control of all our tokens, or allow attackers to spoof authentication." Plain RPC over NATS would let an attacker spoof "yes this token is fine" messages. Solution: implement Noise directly. Verification usesNoise_IK(like TLS — "anybody can verify, but everyone needs to prove they're talking to the realtkdb"); signing usesNoise_KK(like mTLS — "only a few components in our system can mint tokens, and they get a special client key"). A later migration replaced the NATS transport with HTTP but kept the Noise layer "out of laziness" — "a design smell, but the security model is nice: across many thousands of machines, there are only a handful with the cryptographic material needed to mint a new Macaroon token." Canonical wiki instance of patterns/noise-over-http. - Routing:
tkdbis a Fly App in isolated regions reached over FlyCast. Clients in Singapore route to AU; AU-out failover is transparent; thetkdbclient library does exponential-backoff retry. Even with all that routing, "we don't like that Macaroon token verification is 'online'" — "when you operate a global public cloud one of the first things you learn is that the global Internet sucks." - Verification cache — >98% hit rate. Property: chained-HMAC implies a verified parent token verifies any descendant. "Macaroons, as it turns out, cache beautifully. That's because once you've seen and verified a Macaroon, you have enough information to verify any more-specific Macaroon that descends from it; that's a property of their chaining HMAC construction. Our client libraries cache verifications, and the cache ratio for verification is over 98%." The 98% is a named reliability engineering win: "transoceanic links" on the auth path are mostly avoided in steady state.
- Revocation is first-class — "revocation isn't a corner case,
it can't be an afterthought." Concrete SQL schema in the post:
Revocation is nonce-level: "revoke takes the random nonce from the beginning of the Macaroon, discarding the rest, and adds it to the blacklist. Every Macaroon in the lineage of that nonce is now dead." If revocation "doesn't work reliably, you wind up with 'cosmetic logout', which is a real vulnerability. When we kill a token, it needs to stay dead."
CREATE TABLE IF NOT EXISTS blacklist ( nonce BLOB NOT NULL UNIQUE, required_until DATETIME, created_at DATETIME DEFAULT CURRENT_TIMESTAMP ); - Revocation propagates via a
subscription feed, not
by distributing the blacklist. "We certainly don't want to
propagate the blacklist database to 35 regions around the
globe." Instead, the
tkdbverification API exports a "feed of revocation notifications" that clients subscribe to (polled in practice). When revocations arrive, clients prune their caches. Fail-closed behavior on connectivity loss: if clients lose connectivity past a threshold interval, "they just dump their entire cache, forcing verification to happen attkdb." This is the canonical inversion of distribute-blacklist- globally — distribute invalidations instead. - Service tokens — strip the third-party authentication caveat
via
tkdb. Macaroons by themselves express authorization, not authentication. A user's Fly.io Macaroon has a third-party caveat that says "this token is only valid if accompanied by the discharge token for a user in your organization from our authentication system." For running-code service tokens, you don't want the authenticator token living next to the code. Solution: "tkdbexports an API that uses its access to token secrets to strip off the third-party authentication caveat. To call into that API, you have to present a valid discharging authentication token; that is, you have to prove you could already have done whatever the token said." Canonical patterns/third-party-caveat-strip-for-service-token. The returned service token has no expiration ("you don't usually want service tokens to expire"), but: the recipient can attenuate further — "lock it to a particular instance of flyd, or to a particular Fly Machine." Net: exfiltrating the service token doesn't help an attacker unless they also control the environment it's bound to. Traceability property: "every token used in production is traceable in some way to a valid token a user submitted." - Pet Semetary — Fly's internal Vault replacement — uses the same third-party-caveat trick for secret reads. Pet Semetary is its own Macaroon authority managing user secrets (e.g., Postgres connection strings). Flyd (the orchestrator running on every worker) needs to inject secrets into Fly Machines at boot, but giving every flyd a read-every-user's-secret Macaroon collapses isolation down to "every worker is equally privileged". Solution: the "read secret" Macaroon flyd holds has a third-party caveat dischargeable only by proving, via normal Macaroon tokens, that you have the org's permissions. Access is traceable to an end-user action and minimized across the fleet.
- Telemetry: OpenTelemetry +
Honeycomb + permanent-retention
OpenSearch audit trail. Ptacek's
explicit retraction of prior skepticism: "Once, I was an '80%
of the value of tracing, we can get from logs and metrics'
person. But I was wrong." OTel's
context propagation gives
"a single narrative about what's happening" from API server to
tkdb. "Thetkdbcode is remarkably stable and there hasn't been an incident intervention with our token system in over a year." The audit trail is itself architecturally load-bearing: "virtually all the operations that happen on our platform are mediated by Macaroons, [so] this audit trail is itself pretty powerful." - Culture → design disclosure. "As an engineering culture, we're allergic to 'microservices', and we flinched a bit at the prospect of adding a specific service just to manage tokens. But it's pulled its weight, and not added really any drama at all. We have at this point a second dedicated security service (Petsem), and even though they sort of rhyme with each other, we've got no plans to merge them. Rule #10 and all that." Explicit: narrow-purpose security services justify the carve- out from a microservices-averse default.
- Infrastructure-SQLite endorsement — earned, not assumed.
Closing note: "a total victory for LiteFS, Litestream, and
infrastructure SQLite. Which, after managing an infrastructure
SQLite project that routinely ballooned to tens of gigabytes
and occasionally threatened service outages, is lovely to
see." The "infrastructure SQLite project that ballooned to
tens of gigabytes" reference is almost certainly
corrosion/corrosion2 — making this
post an implicit contrast to corrosion2's operational
footprint, where
tkdb's dozens-of-megs is the happy counter-example.
Extracted systems¶
- systems/tkdb — NEW. Fly.io's isolated Macaroon token service. 5000 lines of Go, SQLite-backed, LiteFS-replicated, Litestream-PITR'd. HTTP/Noise RPC.
- systems/litefs — NEW. Primary/replica distributed SQLite
substrate; subsecond US→EU/AU replication + primary failover
for
tkdb. - systems/litestream — NEW. Point-in-time SQLite replication
to object storage; Fly uses it for
tkdbrecovery. - systems/sqlite — EXTEND. New role: root-of-trust store for a token authority. Adds to the Fly.io multi-SQLite-role story.
- systems/petsem — NEW. "Pet Semetary". Fly's in-house Vault replacement; its own Macaroon authority.
- systems/macaroon-superfly — NEW.
github.com/superfly/ macaroon— Fly.io's open-source Macaroon implementation (Go). - systems/nats — EXTEND. Additional retirement datapoint:
tkdbRPC migrated NATS → HTTP (but kept Noise). - systems/honeycomb — NEW. Distributed-tracing backend Fly.io uses; Ptacek's explicit endorsement after prior skepticism.
- systems/opentelemetry — NEW. Fly.io's tracing standard; OTel
context propagation gives single-narrative request tracing for
tkdb. - systems/flycast — EXTEND. Named here as the routing substrate
for
tkdb's multi-region isolated Fly-App deployment.
Extracted concepts¶
- concepts/macaroon-token — NEW. What a Macaroon is (bearer token with chained-HMAC user-attenuation primitive).
- concepts/chained-hmac-construction — NEW. The cryptographic construction that makes Macaroons cache-friendly (verifying a parent verifies all descendants).
- concepts/attenuation-offline — NEW. Adding caveats without touching the token authority.
- concepts/online-stateful-token — NEW. Tokens that require a database lookup on verify; Macaroons are this, JWTs are not.
- concepts/third-party-caveat — NEW. The Macaroon mechanism for requiring an external system to issue a discharge.
- concepts/discharge-token — NEW. The companion token that satisfies a third-party caveat.
- concepts/authorization-vs-authentication-token — NEW. Fly's explicit split: Macaroons carry authorization, a separate discharge carries authentication.
- concepts/revocation-feed-subscription — NEW. Propagating revocations via a subscribable feed rather than a replicated blacklist.
- concepts/cosmetic-logout — NEW. The failure mode of a logout that doesn't revoke the underlying token.
- concepts/verification-cache — NEW. High-hit-rate local cache of token-verification results.
- concepts/keep-hazmat-away-from-complex-code — NEW. Fly's stated secure-design heuristic.
- concepts/context-propagation-otel — NEW. OTel's single- narrative request trace primitive.
- concepts/audit-trail-in-opensearch — NEW. Permanent-retention audit log for security-operation traceability.
Extracted patterns¶
- patterns/isolated-token-service — NEW. The wiki-level
pattern
tkdbinstantiates: carve out the token-authority database onto isolated hardware, off the primary-API blast-radius. - patterns/sqlite-plus-litefs-plus-litestream — NEW. The concrete storage-stack recipe: SQLite + LiteFS (replication + failover) + Litestream (PITR).
- patterns/verification-cache-with-revocation-feed — NEW.
98% client-side verification cache + subscribe-to-revocations
- fail-closed on disconnect.
- patterns/third-party-caveat-strip-for-service-token — NEW.
User discharges an authentication caveat once,
tkdbreturns a version of the token without it, recipient locks it down further via attenuation. - patterns/attenuate-on-use — NEW. Transmit only the minimum privilege needed per operation; attenuate before every call.
- patterns/noise-over-http — NEW. Keep Noise as the crypto envelope even after the message-bus transport is replaced with HTTP.
- patterns/caveat-for-privilege-separation — NEW. Use a third-party caveat (dischargeable only by proving org permissions) to bind a broadly-privileged secret-reader token to specific end-user actions, across a fleet.
Operational numbers¶
tkdbcode size: ~5,000 lines of Go. (Source: article body.)tkdbSQLite database size: "a couple dozen megs". Most of that data isn't real (test/noise).- PITR full recovery: "just seconds".
- Client verification cache hit rate: >98% ("over 98%").
- LiteFS replication: subsecond US primary → EU + AU replicas.
- Regions: 3 isolated — US, EU, AU.
- Incident intervention count in token system: 0 in over a year.
- Reachability mechanism: FlyCast Anycast routing (Singapore → AU; AU-out → closest backup).
- Noise patterns used:
Noise_IKfor verification (server-auth only — like TLS);Noise_KKfor signing (mutual-auth — like mTLS). - Global deploy surface: "35 regions around the globe" (stated to justify not replicating the blacklist globally).
- Internal Macaroon library LOC & scope: most is open source at
github.com/superfly/macaroon.
Caveats and things not disclosed¶
- Exact SQLite schema beyond the
blacklisttable is not shown (root-key table, signing-key tables, etc. — not dumped). - Noise keypair provisioning — how the "handful" of signing
clients get their
Noise_KKclient key at bootstrap is not documented. - Revocation-feed mechanics — named as "subscribe (really, polls)"; poll interval not disclosed. Threshold for cache-dump on disconnect not disclosed.
- Macaroon library metrics — only "Most of the code is open source"; unclear what isn't.
- The SQLite-ballooned-to-tens-of-gigs project referenced in the closing is implied (very likely corrosion) but not named.
- Audit-trail scale — index size, retention policy details, query latency on the OpenSearch cluster not disclosed.
- Macaroon-feature-underuse — the "users don't really take advantage of token features" claim is qualitative; no usage- distribution breakdown.
- Pet Semetary architecture — mentioned only as a second instance of the pattern; its own deep dive is not in this post.
- Deployment isolation — "Fly-only isolated regions" for
tkdbare named but not defined (worker pool? region? account isolation?). - Comparative performance — no p50/p99 numbers for signing or verification round-trips; no latency budget quoted.
Cross-references within Fly.io wiki¶
This is the ninth Fly.io ingest on this wiki (after FKS, Tigris, JIT WireGuard, AWS OIDC, L40S, Making-Machines-Move, Livebook/FLAME, VSCode-SSH-bananas, the Exit Interview, and the Rust-Proxy incident). It is the deepest architectural disclosure of Fly.io's security-token stack to date — prior Macaroon coverage on this wiki lived almost entirely in sources/2024-06-19-flyio-aws-without-access-keys (workload- identity / OIDC-federation angle). This post is the complementary half: inside the Fly.io-to-Fly.io trust boundary rather than Fly.io-to-AWS.
- Extends sources/2024-06-19-flyio-aws-without-access-keys — AWS-OIDC post described how Fly Machines use Macaroons (env-variable-driven credential chain + Unix-socket metadata service); this post describes how they're issued, verified, revoked, and cached behind the scenes.
- Extends sources/2024-03-12-flyio-jit-wireguard-peers —
the NATS-retirement arc gets another datapoint:
tkdbis another internal service that moved NATS → HTTP (though it kept Noise as the envelope, unlikeflydwhich went plain HTTP). - Complementary contrast to
sources/2025-02-12-flyio-the-exit-interview-jp-phillips —
that post is JP's defense of BoltDB-for-flyd (blast-radius-of-
ad-hoc-SQL argument). This post is the counter-instance:
tkdbis exactly the "SQLite works great when it's tiny and one team owns it" case JP would endorse. - Cross-references sources/2025-02-26-flyio-taming-a-voracious-rust-proxy for the OpenTelemetry theme — both posts cite OTel+Honeycomb as load-bearing for Fly.io operations.
Source¶
- Original: https://fly.io/blog/operationalizing-macaroons/
- Raw markdown:
raw/flyio/2025-03-27-operationalizing-macaroons-35df34f6.md
Related¶
- companies/flyio
- systems/tkdb · systems/litefs · systems/litestream · systems/petsem · systems/macaroon-superfly · systems/sqlite · systems/nats · systems/honeycomb · systems/opentelemetry · systems/flycast
- concepts/macaroon-token · concepts/chained-hmac-construction · concepts/attenuation-offline · concepts/online-stateful-token · concepts/third-party-caveat · concepts/discharge-token · concepts/authorization-vs-authentication-token · concepts/revocation-feed-subscription · concepts/cosmetic-logout · concepts/verification-cache · concepts/keep-hazmat-away-from-complex-code · concepts/context-propagation-otel · concepts/audit-trail-in-opensearch
- patterns/isolated-token-service · patterns/sqlite-plus-litefs-plus-litestream · patterns/verification-cache-with-revocation-feed · patterns/third-party-caveat-strip-for-service-token · patterns/attenuate-on-use · patterns/noise-over-http · patterns/caveat-for-privilege-separation
- sources/2024-06-19-flyio-aws-without-access-keys — complementary half (how Macaroons are used across a trust boundary).