Skip to content

CONCEPT Cited by 7 sources

Observability

The function of providing visibility into application performance and reliability via metrics, logs, and traces. The core operational quality it serves: lowering MTTD (mean time to detect) and MTTR (mean time to repair) by making system behavior legible to humans and tools.

Why orgs build vs buy

Vendor-managed observability is the default early, but tensions emerge:

  • Pricing model misalignment. Vendors typically charge on ingested data volume. Costs scale with telemetry growth regardless of whether that telemetry reduces MTTD/MTTR.
  • Outside the feedback loop. A third-party platform leaves the infrastructure team unable to iterate on how telemetry is consumed (dashboards, alerting UX, query ergonomics) or drive cost reductions.
  • More data ≠ better insights. Higher cardinality and retention do not automatically translate to faster incident response; the bottleneck is usually query / authoring UX and signal quality, not data volume.

These pressures are what pushed Airbnb to own the stack end to end (Source: sources/2026-03-17-airbnb-observability-ownership-migration).

Observability ownership (spectrum)

  1. Vendor-owned stack — instrument + ship data; consume via vendor UI.
  2. Own the interaction layer — vendor backend, in-house dashboarding
  3. alert authoring. Lets the platform team shape UX without running storage.
  4. Own the full lifecycle — in-house collection, storage, query, visualization, alerting. Maximum control, highest operational cost.

Airbnb's top migration lesson (and the cheapest lever for any org) is to own the interaction layer early — see patterns/own-the-interaction-layer (captured here as part of sources/2026-03-17-airbnb-observability-ownership-migration).

Pipeline-layer concerns (collection & aggregation)

Beyond "what to expose in the UI", a production-scale metrics pipeline has to solve:

  • Protocol choice. OTLP (TCP, vendor-neutral, CNCF-sponsored) beats StatsD (UDP, packet loss under load) on reliability and ecosystem. Moving to OTLP also lets you drop in-pipeline StatsD→OTLP translation and unlocks features like Prometheus exponential histograms. (Source: sources/2026-04-16-airbnb-statsd-to-otel-metrics-pipeline)
  • Temporality. Cumulative vs. delta is a memory/accuracy trade-off at the SDK — see concepts/metric-temporality.
  • Cost control via streaming aggregation. Dropping per-instance labels in-transit (not in storage) is typically the cheapest 10× cost lever — see concepts/streaming-aggregation and systems/vmagent.
  • Centralize semantic fixes. A stateful aggregation tier is the right place to solve backend quirks (e.g. sparse-counter undercounting — patterns/zero-injection-counter) so they don't leak into every user's dashboards.
  • Migration choreography. Dual-write at a shared instrumentation library keeps protocol migrations low-friction — patterns/dual-write-migration.

Common anti-patterns

  • Dashboards computing averages of values that should be percentiles.
  • Summing total latency across requests.
  • Alert configs maintained as sparsely documented files with no backtesting or diffing — "fire-and-forget alerting".
  • Metric types inferred from naming conventions without a source of truth (breaks when names drift — see concepts/metric-type-metadata).

Agent-assisted debugging layer

The classic observability triad (metrics / logs / traces) is necessary but not sufficient for MTTR when multiple specialized tools must be stitched together during an incident. Post-triad, orgs are building an intelligence layer above observability data that correlates signals across layers, codifies runbook knowledge, and guides engineers to safe next steps. This doesn't replace metrics/logs/traces — it composes them.

Signals that this layer is warranted:

  • Engineers juggle 4+ tools (dashboards, CLIs, cloud consoles, custom scripts) per incident with no unified entrypoint.
  • Postmortems repeatedly blame "missing data" but the data was present — just scattered.
  • Senior engineers dominate incident-response; juniors can't start.

Architectural requirements for this layer:

(Source: sources/2025-12-03-databricks-ai-agent-debug-databases)

Failure modes of the observability stack itself

Two failure modes show up as the observability pipeline scales to thousands-of-emitter fleets (large GPU clusters, container fleets):

Both are first-class concerns in HyperPod-scale ML infra (systems/aws-sagemaker-hyperpod). (Source: sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development)

Seen in

Last updated · 178 distilled / 1,178 read