Skip to content

SYSTEM Cited by 1 source

tkdb

tkdb is Fly.io's isolated Macaroon-token authority — the root-of-trust database that every security token on Fly.io's platform is verified against, and the signing service that mints new ones. "About 5000 lines of Go code that manages a SQLite database that is in turn managed by LiteFS and Litestream." (Source: sources/2025-03-27-flyio-operationalizing-macaroons.)

tkdb is the canonical wiki instance of patterns/isolated-token-service: pull the token-authority database off the primary API cluster onto isolated hardware, both to cap blast radius when the API misbehaves and to keep root secrets ("hazmat") away from the most complicated code in the platform (concepts/keep-hazmat-away-from-complex-code).

Architectural shape

  Clients (flyd, API, edge proxy, petsem, ...)
          ▼  FlyCast Anycast (→ nearest region)
  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐
  │  tkdb-US      │  │  tkdb-EU      │  │  tkdb-AU      │
  │  (primary)    │◀─┤  (replica)    │  │  (replica)    │
  └───────────────┘  └───────────────┘  └───────────────┘
          │         subsecond LiteFS replication
    Litestream → Object Storage  (PITR)
  • Runtime: tkdb is a Fly App — "albeit deployed in special Fly-only isolated regions". Three regions: US (primary), EU, AU.
  • Storage substrate: SQLite, replicated by LiteFS (subsecond US→EU/AU + primary- failover), backed up by Litestream to object storage (PITR).
  • Reachability: FlyCast — Fly's internal Anycast service. "If you're in Singapore, you're probably going to get routed to the Australian tkdb. If Australia falls over, you'll get routed to the closest backup. The proxy that implements FlyCast is smart, as is the tkdb client library, which will do exponential backoff retry transparently."
  • At-rest encryption: "records in the database are encrypted with an injected secret." The secret is injected (not stored in the DB).

The two APIs

tkdb exposes two RPC surfaces over an HTTP/Noise channel (patterns/noise-over-http):

API Noise pattern Who can call Purpose
Verification Noise_IK anybody (server-auth only) verify a token; subscribe to revocation feed
Signing Noise_KK handful with client key mint new tokens; revoke; strip 3p caveats
  • Noise_IK works like TLS: the client authenticates the server ("everyone needs to prove they're talking to the real tkdb") but the server doesn't require a client identity.
  • Noise_KK works like mTLS: the client also presents a pre-provisioned static key. "Across many thousands of machines, there are only a handful with the cryptographic material needed to mint a new Macaroon token."

The transport was originally RPC-over-NATS; JP Phillips's team later replaced NATS with HTTP, but "out of laziness, we kept the Noise stuff, which means the interface to tkdb is now HTTP/Noise. This is a design smell, but the security model is nice."

What's in the database

Writes are explicitly rare:

  1. HMAC root-key insertion"we just need to record an HMAC key when Fly.io organizations are created (that is, roughly, when people sign up for the service and actually do a deploy)." One root key per Fly.io org ("you don't share keys with your neighbors").
  2. Revocation list — the blacklist table, keyed on Macaroon nonce:
    CREATE TABLE IF NOT EXISTS blacklist (
      nonce          BLOB NOT NULL UNIQUE,
      required_until DATETIME,
      created_at     DATETIME DEFAULT CURRENT_TIMESTAMP
    );
    

Result: total DB size ≈ "a couple dozen megs", and "most of that data isn't real." (Attenuation happens offline — all per-use caveats get hashed onto the chain without touching tkdb, which is the load-bearing reason the DB stays small.)

Why tkdb is not colocated with the primary API

Two stated reasons:

  1. Reliability / blast-radius. "Far and away the most common failure mode of an outage on our platform is 'deploys are broken', and those failures are usually caused by API instability. It would not be OK if 'deploys are broken' transitively meant 'deployed apps can't use security tokens'."
  2. Security. "Root secrets for Macaroon tokens are hazmat, and a basic rule of thumb in secure design is: keep hazmat away from complicated code." Canonical wiki statement of concepts/keep-hazmat-away-from-complex-code.

Caching: offload most verification away from tkdb

Verification-cache hit rate on the client side is "over 98%" thanks to Macaroons' chained-HMAC property (verifying a parent verifies any descendant). Revocation is propagated via a subscription feed the verification API exports — clients poll, prune caches on revocation notifications, and dump the entire cache if disconnected past a threshold, forcing verifications back through tkdb (fail-closed).

See patterns/verification-cache-with-revocation-feed.

Why SQLite works here (where it doesn't elsewhere at Fly)

"We use SQLite for a lot of our infrastructure, and this is one of the very few well-behaved databases we have." Contrast points on the Fly.io wiki:

  • corrosion/corrosion2 — infrastructure SQLite that "routinely ballooned to tens of gigabytes and occasionally threatened service outages" per this post's closing line. tkdb's dozens-of-megs is the happy counter-case.
  • flyd — deliberately avoids SQLite for authoritative orchestrator state (concepts/bolt-vs-sqlite-storage-choice — blast-radius- of-an-ad-hoc-SQL-update argument).
  • tkdb sits at the narrow-schema, low-write, small-data, single-team-owned, blast-radius-controlled end of the SQLite spectrum — the case JP Phillips would endorse.

Operational posture

  • Incident interventions in over a year: 0. "The tkdb code is remarkably stable and there hasn't been an incident intervention with our token system in over a year."
  • PITR recovery: seconds. Litestream-driven.
  • Telemetry: OpenTelemetry + Honeycomb for traces; Prometheus metrics; permanent-retention OpenSearch audit-trail index covering every token operation.

Service-token stripping (third-party-caveat remover)

tkdb exports one more signing-API operation worth calling out: strip the third-party authentication caveat off a user's Macaroon to produce a service token. The caller must present a valid discharging authentication token (proving they could already have done whatever the token says); tkdb returns the user's token with the third-party caveat and expiration removed. The recipient then further attenuates the token — locking it to a specific flyd instance or Fly Machine — so exfiltration doesn't buy an attacker anything unless they also control the bound environment. See patterns/third-party-caveat-strip-for-service-token.

Pet Semetary uses the same pattern for secret-read tokens.

Seen in

Last updated · 200 distilled / 1,178 read