Skip to content

CONCEPT Cited by 1 source

Monitoring paradox

Monitoring paradox: the observability layer deployed to catch infrastructure problems becomes an infrastructure problem. Werner Vogels's phrasing:

The very system designed to prevent problems becomes the source of problems itself.

Removing friction from Amazon SageMaker AI development, 2025-08-06 (Source: sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development)

Typical manifestations

  • Single-threaded collectors that hit CPU limits under high-cardinality workloads and can't keep up. Metric drops; you don't notice the drop because you're reading the metric you don't have.
  • Log agents filling disks, causing the very host failure they were installed to help diagnose.
  • Log/metric pipelines under backpressure degrading production path performance (e.g., synchronous emit in hot paths, full queues blocking workers).
  • Dashboard/alert storms that overload the alerting pipeline during real incidents, leading to missed or delayed alerts on the actual cause.

At SageMaker HyperPod's scale — "hundreds or thousands of GPUs" — the single-threaded-collector form is load-bearing: "monitoring agents fill up disk space, causing the very training failures they're meant to prevent." (Source: sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development)

Structural answers

  1. Scale the collection tier with the workload. The collector is not a fixed-size process; it is a fleet whose capacity tracks the workload it observes. See patterns/auto-scaling-telemetry-collector. HyperPod's observability capability implements this. (Source: sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development)
  2. Streaming aggregation before storage. Push label-cardinality reduction into a stateful in-transit tier so the storage backend never sees per-instance cardinality. See concepts/streaming-aggregation and systems/vmagent.
  3. Rate-limit and drop on backpressure at the SDK. Monitoring code paths should never block production request handling; prefer dropping samples over blocking emits.
  4. Monitor the monitoring. Metrics on collector CPU, queue depth, drop rate, and disk headroom should be first-class — ideally emitted through a different pipeline from the one they're monitoring, to avoid circular dependence.
  5. Quotas per tenant / workload. A single noisy workload shouldn't fill the shared collection tier for everyone else.

The deeper pattern

The monitoring paradox is a specific case of the general "the subsystem you built to keep things safe is its own single point of failure" trap — same shape as a control plane that stalls the data plane, or a security gateway that DDoSes itself, or a rate limiter that runs out of memory. The remedy is usually the same shape too: capacity must scale with load, not be pre-provisioned; blast radius must be bounded; and the safety system must not sit synchronously in the hot path.

Seen in

Last updated · 200 distilled / 1,178 read