Skip to content

SYSTEM Cited by 2 sources

OpenTelemetry

OpenTelemetry (OTel; opentelemetry.io) is the open standard for instrumenting applications with distributed traces, metrics, and logs. It is the instrumentation-side complement to an observability backend like Honeycomb.

Why it shows up on the wiki

OTel is cited in the Fly.io corpus as the single most important observability investment Fly.io made, with reversals on prior skepticism from two different authors.

From Thomas Ptacek's 2025-03-27 post on tkdb:

"Most of that is down to OpenTelemetry and Honeycomb. From the moment a request hits our API server through the moment tkdb responds to it, oTel context propagation gives us a single narrative about what's happening. I was a skeptic about oTel. It's really, really expensive. And, not to put too fine a point on it, oTel really cruds up our code. Once, I was an '80% of the value of tracing, we can get from logs and metrics' person. But I was wrong." (Source: sources/2025-03-27-flyio-operationalizing-macaroons.)

From JP Phillips's 2025-02-12 exit interview:

"Without oTel, it'd be a disaster trying to troubleshoot the system. I'd have ragequit trying." (Source: sources/2025-02-12-flyio-the-exit-interview-jp-phillips.)

Load-bearing property: context propagation

The specific OTel feature Fly.io repeatedly names is context propagation — a trace ID and span context that travels with a request across process, service, and network boundaries, so that every span emitted by every service on the request path can be stitched into a single trace tree.

Fly.io's stack has at least these spans per request:

  • Primary API (entry point, user-facing).
  • tkdb client library (verification / sign / revoke).
  • tkdb server (Noise handshake, SQLite query, response).

Without propagation, each service would produce its own orphan logs — diagnosing a verification failure would require hand-correlation by timestamp. With propagation, the whole lineage is one trace in Honeycomb.

Trade-offs Fly.io names

  • "Really, really expensive" — both in ingestion cost and infrastructure.
  • "Cruds up our code" — instrumenting every call site is invasive.
  • Counterweight: "worth the money to pay someone else to manage tracing data" (JP).
  • Net judgment: "I was wrong" (Ptacek) — the 20% tracing adds over logs+metrics is load-bearing, not diminishing- returns.

Seen in

Last updated · 200 distilled / 1,178 read