SYSTEM Cited by 1 source
tkdb¶
tkdb is Fly.io's isolated Macaroon-token
authority — the root-of-trust database that every security
token on Fly.io's platform is verified against, and the signing
service that mints new ones. "About 5000 lines of Go code that
manages a SQLite database that is in turn managed by
LiteFS and Litestream."
(Source: sources/2025-03-27-flyio-operationalizing-macaroons.)
tkdb is the canonical wiki instance of
patterns/isolated-token-service: pull the token-authority
database off the primary API cluster onto isolated hardware,
both to cap blast radius when the API misbehaves and to keep
root secrets ("hazmat") away from the most complicated code in
the platform
(concepts/keep-hazmat-away-from-complex-code).
Architectural shape¶
Clients (flyd, API, edge proxy, petsem, ...)
│
▼ FlyCast Anycast (→ nearest region)
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ tkdb-US │ │ tkdb-EU │ │ tkdb-AU │
│ (primary) │◀─┤ (replica) │ │ (replica) │
└───────────────┘ └───────────────┘ └───────────────┘
│ subsecond LiteFS replication
▼
Litestream → Object Storage (PITR)
- Runtime:
tkdbis a Fly App — "albeit deployed in special Fly-only isolated regions". Three regions: US (primary), EU, AU. - Storage substrate: SQLite, replicated by LiteFS (subsecond US→EU/AU + primary- failover), backed up by Litestream to object storage (PITR).
- Reachability: FlyCast — Fly's internal
Anycast service. "If you're in Singapore, you're probably
going to get routed to the Australian
tkdb. If Australia falls over, you'll get routed to the closest backup. The proxy that implements FlyCast is smart, as is thetkdbclient library, which will do exponential backoff retry transparently." - At-rest encryption: "records in the database are encrypted with an injected secret." The secret is injected (not stored in the DB).
The two APIs¶
tkdb exposes two RPC surfaces over an HTTP/Noise channel
(patterns/noise-over-http):
| API | Noise pattern | Who can call | Purpose |
|---|---|---|---|
| Verification | Noise_IK |
anybody (server-auth only) | verify a token; subscribe to revocation feed |
| Signing | Noise_KK |
handful with client key | mint new tokens; revoke; strip 3p caveats |
Noise_IKworks like TLS: the client authenticates the server ("everyone needs to prove they're talking to the realtkdb") but the server doesn't require a client identity.Noise_KKworks like mTLS: the client also presents a pre-provisioned static key. "Across many thousands of machines, there are only a handful with the cryptographic material needed to mint a new Macaroon token."
The transport was originally RPC-over-NATS; JP Phillips's team
later replaced NATS with HTTP, but "out of laziness, we kept
the Noise stuff, which means the interface to tkdb is now
HTTP/Noise. This is a design smell, but the security model is
nice."
What's in the database¶
Writes are explicitly rare:
- HMAC root-key insertion — "we just need to record an HMAC key when Fly.io organizations are created (that is, roughly, when people sign up for the service and actually do a deploy)." One root key per Fly.io org ("you don't share keys with your neighbors").
- Revocation list — the
blacklisttable, keyed on Macaroon nonce:
Result: total DB size ≈ "a couple dozen megs", and "most
of that data isn't real."
(Attenuation happens offline —
all per-use caveats get hashed onto the chain without touching
tkdb, which is the load-bearing reason the DB stays small.)
Why tkdb is not colocated with the primary API¶
Two stated reasons:
- Reliability / blast-radius. "Far and away the most common failure mode of an outage on our platform is 'deploys are broken', and those failures are usually caused by API instability. It would not be OK if 'deploys are broken' transitively meant 'deployed apps can't use security tokens'."
- Security. "Root secrets for Macaroon tokens are hazmat, and a basic rule of thumb in secure design is: keep hazmat away from complicated code." Canonical wiki statement of concepts/keep-hazmat-away-from-complex-code.
Caching: offload most verification away from tkdb¶
Verification-cache hit rate on the client side is "over 98%"
thanks to Macaroons'
chained-HMAC property
(verifying a parent verifies any descendant). Revocation is
propagated via a subscription feed the verification API
exports — clients poll, prune caches on revocation
notifications, and dump the entire cache if disconnected
past a threshold, forcing verifications back through tkdb
(fail-closed).
See patterns/verification-cache-with-revocation-feed.
Why SQLite works here (where it doesn't elsewhere at Fly)¶
"We use SQLite for a lot of our infrastructure, and this is one of the very few well-behaved databases we have." Contrast points on the Fly.io wiki:
- corrosion/corrosion2 —
infrastructure SQLite that "routinely ballooned to tens of
gigabytes and occasionally threatened service outages" per
this post's closing line.
tkdb's dozens-of-megs is the happy counter-case. - flyd — deliberately avoids SQLite for authoritative orchestrator state (concepts/bolt-vs-sqlite-storage-choice — blast-radius- of-an-ad-hoc-SQL-update argument).
tkdbsits at the narrow-schema, low-write, small-data, single-team-owned, blast-radius-controlled end of the SQLite spectrum — the case JP Phillips would endorse.
Operational posture¶
- Incident interventions in over a year: 0. "The
tkdbcode is remarkably stable and there hasn't been an incident intervention with our token system in over a year." - PITR recovery: seconds. Litestream-driven.
- Telemetry: OpenTelemetry + Honeycomb for traces; Prometheus metrics; permanent-retention OpenSearch audit-trail index covering every token operation.
Service-token stripping (third-party-caveat remover)¶
tkdb exports one more signing-API operation worth calling
out: strip the third-party authentication caveat off a
user's Macaroon to produce a service token. The caller must
present a valid discharging authentication token (proving they
could already have done whatever the token says); tkdb
returns the user's token with the third-party caveat and
expiration removed. The recipient then further attenuates the
token — locking it to a specific flyd instance or Fly Machine
— so exfiltration doesn't buy an attacker anything unless they
also control the bound environment. See
patterns/third-party-caveat-strip-for-service-token.
Pet Semetary uses the same pattern for secret-read tokens.
Seen in¶
- sources/2025-03-27-flyio-operationalizing-macaroons —
canonical wiki source; names
tkdbas the Fly.io Macaroon authority and gives a full architectural tour.
Related¶
- systems/litefs — the replication substrate.
- systems/litestream — the PITR substrate.
- systems/sqlite — the on-disk format.
- systems/petsem — sibling security service (Vault replacement).
- systems/macaroon-superfly — the open-source Macaroon
implementation
tkdbis built on top of. - systems/flycast — reachability substrate.
- systems/nats — retired RPC transport.
- concepts/macaroon-token — the token primitive
tkdbauthenticates. - concepts/online-stateful-token — the property that
forces
tkdb's existence. - concepts/keep-hazmat-away-from-complex-code — the design principle that forces its isolation.
- patterns/isolated-token-service — the wiki pattern
tkdbcanonically instantiates. - patterns/sqlite-plus-litefs-plus-litestream — the storage recipe.
- patterns/noise-over-http — the transport recipe.
- patterns/verification-cache-with-revocation-feed — the steady-state load-management pattern.
- companies/flyio.