CONCEPT Cited by 12 sources

Observability¶

The function of providing visibility into application performance and reliability via metrics, logs, and traces. The core operational quality it serves: lowering MTTD (mean time to detect) and MTTR (mean time to repair) by making system behavior legible to humans and tools.

Why orgs build vs buy¶

Vendor-managed observability is the default early, but tensions emerge:

Pricing model misalignment. Vendors typically charge on ingested data volume. Costs scale with telemetry growth regardless of whether that telemetry reduces MTTD/MTTR.
Outside the feedback loop. A third-party platform leaves the infrastructure team unable to iterate on how telemetry is consumed (dashboards, alerting UX, query ergonomics) or drive cost reductions.
More data ≠ better insights. Higher cardinality and retention do not automatically translate to faster incident response; the bottleneck is usually query / authoring UX and signal quality, not data volume.

These pressures are what pushed Airbnb to own the stack end to end (Source: sources/2026-03-17-airbnb-observability-ownership-migration).

Observability ownership (spectrum)¶

Vendor-owned stack — instrument + ship data; consume via vendor UI.
Own the interaction layer — vendor backend, in-house dashboarding
alert authoring. Lets the platform team shape UX without running storage.
Own the full lifecycle — in-house collection, storage, query, visualization, alerting. Maximum control, highest operational cost.

Airbnb's top migration lesson (and the cheapest lever for any org) is to own the interaction layer early — see patterns/own-the-interaction-layer (captured here as part of sources/2026-03-17-airbnb-observability-ownership-migration).

Pipeline-layer concerns (collection & aggregation)¶

Beyond "what to expose in the UI", a production-scale metrics pipeline has to solve:

Protocol choice. OTLP (TCP, vendor-neutral, CNCF-sponsored) beats StatsD (UDP, packet loss under load) on reliability and ecosystem. Moving to OTLP also lets you drop in-pipeline StatsD→OTLP translation and unlocks features like Prometheus exponential histograms. (Source: sources/2026-04-16-airbnb-statsd-to-otel-metrics-pipeline)
Temporality. Cumulative vs. delta is a memory/accuracy trade-off at the SDK — see concepts/metric-temporality.
Cost control via streaming aggregation. Dropping per-instance labels in-transit (not in storage) is typically the cheapest 10× cost lever — see concepts/streaming-aggregation and systems/vmagent.
Centralize semantic fixes. A stateful aggregation tier is the right place to solve backend quirks (e.g. sparse-counter undercounting — patterns/zero-injection-counter) so they don't leak into every user's dashboards.
Migration choreography. Dual-write at a shared instrumentation library keeps protocol migrations low-friction — patterns/dual-write-migration.

Common anti-patterns¶

Dashboards computing averages of values that should be percentiles.
Summing total latency across requests.
Alert configs maintained as sparsely documented files with no backtesting or diffing — "fire-and-forget alerting".
Metric types inferred from naming conventions without a source of truth (breaks when names drift — see concepts/metric-type-metadata).

Agent-assisted debugging layer¶

The classic observability triad (metrics / logs / traces) is necessary but not sufficient for MTTR when multiple specialized tools must be stitched together during an incident. Post-triad, orgs are building an intelligence layer above observability data that correlates signals across layers, codifies runbook knowledge, and guides engineers to safe next steps. This doesn't replace metrics/logs/traces — it composes them.

Signals that this layer is warranted:

Engineers juggle 4+ tools (dashboards, CLIs, cloud consoles, custom scripts) per incident with no unified entrypoint.
Postmortems repeatedly blame "missing data" but the data was present — just scattered.
Senior engineers dominate incident-response; juniors can't start.

Architectural requirements for this layer:

Unified data substrate across clouds/regions (see concepts/central-first-sharded-architecture).
Fine-grained access control consistent for humans and agents.
Fast iteration loop on prompts and tools without reinventing plumbing (see patterns/tool-decoupled-agent-framework).
Regression harness for non-deterministic agent behavior (see patterns/snapshot-replay-agent-evaluation and concepts/llm-as-judge).
Domain decomposition to keep each agent's tool inventory small (see patterns/specialized-agent-decomposition).

(Source: sources/2025-12-03-databricks-ai-agent-debug-databases)

Failure modes of the observability stack itself¶

Two failure modes show up as the observability pipeline scales to thousands-of-emitter fleets (large GPU clusters, container fleets):

concepts/monitoring-paradox — the stack you built to catch failures becomes the failure. Single-threaded collectors CPU-bound; agents filling disks; emit paths blocking production work. Structural answer: patterns/auto-scaling-telemetry-collector + concepts/streaming-aggregation + drop-before-block backpressure.
concepts/grey-failure — binary up/down monitoring can't see components that are slower / lossier / hotter than nominal. At synchronous-fanout scale, one grey-failing peer throttles the whole job (same math as concepts/tail-latency-at-scale). Detection requires high-cardinality correlation + proactive alerting on degradation, not threshold crossings.

Both are first-class concerns in HyperPod-scale ML infra (systems/aws-sagemaker-hyperpod). (Source: sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development)

Reliability of the observability stack itself¶

A distinct axis of observability maturity, brought into focus by Airbnb's 2026-05-05 post: the observability stack must be more reliable than what it observes, otherwise it goes blind at exactly the wrong moment. Three specific disciplines follow:

Break circular dependencies between the observability stack and the substrates it monitors. At Airbnb: move observability off shared Kubernetes clusters (patterns/dedicated-observability-kubernetes-clusters) and off the shared Istio service mesh (patterns/custom-l7-proxy-for-telemetry-over-service-mesh).
Monitor the monitors. Dedicated Prometheus-Alertmanager HA pairs pinned to distinct nodes and AZs, with pair- level anti-affinity so any single shared-infrastructure failure loses at most one pair.
Terminate the infinite regress with a dead-man's switch. An always-firing heartbeat exits the observability stack to an external control plane (patterns/heartbeat-absence-as-alert-trigger); absence of the heartbeat pages on-call.

The design bar, stated explicitly by Airbnb: "treat monitoring as a production system whose availability must exceed that of what it observes." (Source: sources/2026-05-05-airbnb-monitoring-reliably-at-scale)

Seen in¶

sources/2026-05-05-airbnb-monitoring-reliably-at-scale — canonicalises the reliability-of-the-observability- stack discipline on the wiki: three circular-dependency remedies (dedicated clusters, custom Envoy ingress tier, meta-monitoring with Dead Man's Switch) and the design bar "treat monitoring as a production system whose availability must exceed that of what it observes."
sources/2026-04-29-grafana-get-observability-in-the-terminal-for-you-and-your-agents-with-the-gcx-cli-tool — Grafana's gcx CLI launch canonicalises an agent-driven observability lifecycle — instrumentation (OpenTelemetry wiring + flow validation), alerting, SLOs, synthetic probes, Frontend / Application / Kubernetes Monitoring onboarding, and everything-as-code (pull/edit/push dashboards + alerts + SLOs + checks as local files). Canonical wiki instance of "the shape of the conversation changes" when an agent has production context: five named prompt patterns ("Why did this endpoint get slower this week?" → agent pulls traces + latency histograms; "Is my new query efficient?" → agent runs PromQL against the real backend; "Are we meeting the SLO for checkout?" → agent reads burn rate before writing a line; "This alert is noisy, fix it" → agent inspects rule + firing history + dashboards). Observability itself becomes the substrate for agentic troubleshooting at a degraded or outage-ing service.
sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outage — observability-stack partial-dependency canonicalised via Redpanda's hedged architecture: self-hosted data collection + storage, third-party for dashboarding + alerting. During the 2025-06-12 GCP / Cloudflare cascading outage the third-party was partially affected; the self-hosted substrate remained queryable so observability degraded rather than going blind. Canonicalises concepts/observability-stack-partial-dependency and patterns/hedged-observability-stack. Explicit economic framing: "Had we kept our entire observability stack on that service, we would have lost all our fleet-wide log searching capabilities, forcing us to fail over to another vendor with exponentially bigger cost ramifications given our scale."
sources/2026-03-17-airbnb-observability-ownership-migration — Airbnb's 5-year vendor-to-in-house migration, motivations, and the "own the interaction layer" lesson.
sources/2026-04-16-airbnb-statsd-to-otel-metrics-pipeline — collection + aggregation tier design (OTLP, vmagent streaming aggregation, delta temporality, zero injection).
sources/2025-12-03-databricks-ai-agent-debug-databases — Databricks Storex: AI-agent intelligence layer above metrics/logs/CLI outputs; up to 90% investigation-time reduction; central-first sharded foundation across 3 clouds / hundreds of regions / 8 regulatory domains.
sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development — SageMaker HyperPod observability: auto-scaling collectors + grey-failure detection as named answers to the monitoring-paradox failure mode on thousand-GPU training fleets.
sources/2026-03-18-aws-ai-powered-event-response-for-amazon-eks — observability triad (metrics / logs / traces) as the input substrate for AWS DevOps Agent's AI-driven incident investigation on EKS. Specific sources: Amazon Managed Prometheus (metrics), Amazon CloudWatch Logs (logs), AWS X-Ray (traces). The triad composes with a K8s-API resource scan to drive telemetry-based resource discovery; AWS vendor peer to Datadog's Bits AI SRE on the same category axis (agent-assisted observability-reasoning layer, see the "Agent-assisted debugging layer" section above).
sources/2025-12-11-aws-architecting-conversational-observability-for-cloud-applications — the self-built sibling to the AWS DevOps Agent post above: a reference architecture for a customer-owned agentic troubleshooting loop on EKS built atop Fluent Bit → Kinesis → Lambda + Bedrock embeddings → OpenSearch Serverless (RAG variant) or S3 Vectors + Strands Agents SDK + EKS MCP Server (agentic variant). The two posts pin the "agent-assisted debugging layer" with both self-build and AWS-managed shipping shapes on the same problem. Key design primitives introduced: patterns/allowlisted-read-only-agent-actions (read-only kubectl allowlist + RBAC) for agent-safety on production clusters, and sanitize-before-embedding as the vector-store governance boundary.
sources/2026-04-17-databricks-governing-coding-agent-sprawl-with-unity-ai-gateway — AI-tool telemetry as a Lakehouse dataset. Unity AI Gateway lands coding-agent OpenTelemetry metrics + traces into Unity-Catalog-managed Delta tables, joinable with HR / PR-velocity / capacity-planning data. Generalises the observability-ownership spectrum above to a new axis: where does AI-tool telemetry live? — APM sidecar vs Lakehouse dataset vs LLM-vendor console. See patterns/telemetry-to-lakehouse for the pattern treatment.
— Query-layer actor tagging as a database-observability primitive. PlanetScale Query Insights captures SQLCommenter-style per-query tags (actor, controller, action) on every authenticated request. Every downstream error inherits the tags — including errors raised by the storage engine that the application never sees. A debugging workflow that would normally require request-ID tracing + log joins collapses to "filter the error table by actor". Canonicalised as concepts/actor-tagged-error + patterns/actor-tagged-query-observability; search surface as concepts/query-tag-filter. The worked example debugs a concepts/check-then-act-race between Rails uniqueness validation and a MySQL unique index in minutes — "the actor tag for each batch of errors is always the same" was the load-bearing signal.
sources/2026-03-31-slack-from-custom-to-open-scalable-network-probing-and-http3-readiness — Slack's HTTP/3 edge rollout crystallises the monitor-first- migrate-second discipline: transport-migration (TCP→QUIC/ UDP) created a probing gap because existing client-side black-box probers (SaaS + Slack's Prometheus Blackbox Exporter) are TCP-shaped. Slack closed the gap by open-sourcing QUIC support upstream + running an in-house integration in parallel, then unified HTTP/1.1+HTTP/2+HTTP/3 metrics in systems/grafana. Canonical first-order observability-as-migration-gate datum.
sources/2024-12-12-stripe-the-secret-life-of-dns-packets-investigating-complex-networks — Stripe's DNS-investigation retrospective as a canonical composite-signal debugging instance. Opaque SERVFAIL error codes provided no root-cause information; Unbound's request-list depth metric localised the bottleneck as upstream-latency-driven rather than volume-driven; systems/tcpdump time-bucketed captures revealed which queries were queuing (reverse-DNS for 104.16.0.0/12); and iptables OUTPUT-chain packet counters added a packet-rate metric that confirmed saturation of the AWS VPC resolver's 1,024-pps-per-ENI cap. Canonical wiki instance of repurposing kernel primitives as observability sources when the application-level metric (DNS query rate) looks flat but retries are amplifying outbound traffic ~7×.