CONCEPT Cited by 7 sources
Observability¶
The function of providing visibility into application performance and reliability via metrics, logs, and traces. The core operational quality it serves: lowering MTTD (mean time to detect) and MTTR (mean time to repair) by making system behavior legible to humans and tools.
Why orgs build vs buy¶
Vendor-managed observability is the default early, but tensions emerge:
- Pricing model misalignment. Vendors typically charge on ingested data volume. Costs scale with telemetry growth regardless of whether that telemetry reduces MTTD/MTTR.
- Outside the feedback loop. A third-party platform leaves the infrastructure team unable to iterate on how telemetry is consumed (dashboards, alerting UX, query ergonomics) or drive cost reductions.
- More data ≠ better insights. Higher cardinality and retention do not automatically translate to faster incident response; the bottleneck is usually query / authoring UX and signal quality, not data volume.
These pressures are what pushed Airbnb to own the stack end to end (Source: sources/2026-03-17-airbnb-observability-ownership-migration).
Observability ownership (spectrum)¶
- Vendor-owned stack — instrument + ship data; consume via vendor UI.
- Own the interaction layer — vendor backend, in-house dashboarding
- alert authoring. Lets the platform team shape UX without running storage.
- Own the full lifecycle — in-house collection, storage, query, visualization, alerting. Maximum control, highest operational cost.
Airbnb's top migration lesson (and the cheapest lever for any org) is to own the interaction layer early — see patterns/own-the-interaction-layer (captured here as part of sources/2026-03-17-airbnb-observability-ownership-migration).
Pipeline-layer concerns (collection & aggregation)¶
Beyond "what to expose in the UI", a production-scale metrics pipeline has to solve:
- Protocol choice. OTLP (TCP, vendor-neutral, CNCF-sponsored) beats StatsD (UDP, packet loss under load) on reliability and ecosystem. Moving to OTLP also lets you drop in-pipeline StatsD→OTLP translation and unlocks features like Prometheus exponential histograms. (Source: sources/2026-04-16-airbnb-statsd-to-otel-metrics-pipeline)
- Temporality. Cumulative vs. delta is a memory/accuracy trade-off at the SDK — see concepts/metric-temporality.
- Cost control via streaming aggregation. Dropping per-instance labels in-transit (not in storage) is typically the cheapest 10× cost lever — see concepts/streaming-aggregation and systems/vmagent.
- Centralize semantic fixes. A stateful aggregation tier is the right place to solve backend quirks (e.g. sparse-counter undercounting — patterns/zero-injection-counter) so they don't leak into every user's dashboards.
- Migration choreography. Dual-write at a shared instrumentation library keeps protocol migrations low-friction — patterns/dual-write-migration.
Common anti-patterns¶
- Dashboards computing averages of values that should be percentiles.
- Summing total latency across requests.
- Alert configs maintained as sparsely documented files with no backtesting or diffing — "fire-and-forget alerting".
- Metric types inferred from naming conventions without a source of truth (breaks when names drift — see concepts/metric-type-metadata).
Agent-assisted debugging layer¶
The classic observability triad (metrics / logs / traces) is necessary but not sufficient for MTTR when multiple specialized tools must be stitched together during an incident. Post-triad, orgs are building an intelligence layer above observability data that correlates signals across layers, codifies runbook knowledge, and guides engineers to safe next steps. This doesn't replace metrics/logs/traces — it composes them.
Signals that this layer is warranted:
- Engineers juggle 4+ tools (dashboards, CLIs, cloud consoles, custom scripts) per incident with no unified entrypoint.
- Postmortems repeatedly blame "missing data" but the data was present — just scattered.
- Senior engineers dominate incident-response; juniors can't start.
Architectural requirements for this layer:
- Unified data substrate across clouds/regions (see concepts/central-first-sharded-architecture).
- Fine-grained access control consistent for humans and agents.
- Fast iteration loop on prompts and tools without reinventing plumbing (see patterns/tool-decoupled-agent-framework).
- Regression harness for non-deterministic agent behavior (see patterns/snapshot-replay-agent-evaluation and concepts/llm-as-judge).
- Domain decomposition to keep each agent's tool inventory small (see patterns/specialized-agent-decomposition).
(Source: sources/2025-12-03-databricks-ai-agent-debug-databases)
Failure modes of the observability stack itself¶
Two failure modes show up as the observability pipeline scales to thousands-of-emitter fleets (large GPU clusters, container fleets):
- concepts/monitoring-paradox — the stack you built to catch failures becomes the failure. Single-threaded collectors CPU-bound; agents filling disks; emit paths blocking production work. Structural answer: patterns/auto-scaling-telemetry-collector + concepts/streaming-aggregation + drop-before-block backpressure.
- concepts/grey-failure — binary up/down monitoring can't see components that are slower / lossier / hotter than nominal. At synchronous-fanout scale, one grey-failing peer throttles the whole job (same math as concepts/tail-latency-at-scale). Detection requires high-cardinality correlation + proactive alerting on degradation, not threshold crossings.
Both are first-class concerns in HyperPod-scale ML infra (systems/aws-sagemaker-hyperpod). (Source: sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development)
Seen in¶
- sources/2026-03-17-airbnb-observability-ownership-migration — Airbnb's 5-year vendor-to-in-house migration, motivations, and the "own the interaction layer" lesson.
- sources/2026-04-16-airbnb-statsd-to-otel-metrics-pipeline — collection + aggregation tier design (OTLP, vmagent streaming aggregation, delta temporality, zero injection).
- sources/2025-12-03-databricks-ai-agent-debug-databases — Databricks Storex: AI-agent intelligence layer above metrics/logs/CLI outputs; up to 90% investigation-time reduction; central-first sharded foundation across 3 clouds / hundreds of regions / 8 regulatory domains.
- sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development — SageMaker HyperPod observability: auto-scaling collectors + grey-failure detection as named answers to the monitoring-paradox failure mode on thousand-GPU training fleets.
- sources/2026-03-18-aws-ai-powered-event-response-for-amazon-eks — observability triad (metrics / logs / traces) as the input substrate for AWS DevOps Agent's AI-driven incident investigation on EKS. Specific sources: Amazon Managed Prometheus (metrics), Amazon CloudWatch Logs (logs), AWS X-Ray (traces). The triad composes with a K8s-API resource scan to drive telemetry-based resource discovery; AWS vendor peer to Datadog's Bits AI SRE on the same category axis (agent-assisted observability-reasoning layer, see the "Agent-assisted debugging layer" section above).
- sources/2025-12-11-aws-architecting-conversational-observability-for-cloud-applications — the self-built sibling to the AWS DevOps Agent post above: a reference architecture for a customer-owned agentic troubleshooting loop on EKS built atop Fluent Bit → Kinesis → Lambda + Bedrock embeddings → OpenSearch Serverless (RAG variant) or S3 Vectors + Strands Agents SDK + EKS MCP Server (agentic variant). The two posts pin the "agent-assisted debugging layer" with both self-build and AWS-managed shipping shapes on the same problem. Key design primitives introduced: patterns/allowlisted-read-only-agent-actions (read-only kubectl allowlist + RBAC) for agent-safety on production clusters, and sanitize-before-embedding as the vector-store governance boundary.
- sources/2026-04-17-databricks-governing-coding-agent-sprawl-with-unity-ai-gateway — AI-tool telemetry as a Lakehouse dataset. Unity AI Gateway lands coding-agent OpenTelemetry metrics + traces into Unity-Catalog-managed Delta tables, joinable with HR / PR-velocity / capacity-planning data. Generalises the observability-ownership spectrum above to a new axis: where does AI-tool telemetry live? — APM sidecar vs Lakehouse dataset vs LLM-vendor console. See patterns/telemetry-to-lakehouse for the pattern treatment.