Skip to content

PATTERN Cited by 2 sources

Auto-scaling telemetry collector

Run the telemetry-collection tier (metrics / logs / traces scrapers and forwarders) as a horizontally-scaling fleet whose capacity tracks the workload it observes, rather than as a fixed-size per-node process or a single-threaded aggregator. Prevents the collector from becoming the bottleneck the observability pipeline is supposed to monitor — see concepts/monitoring-paradox.

Why single-threaded or fixed-size collectors fail

At GPU-cluster or large-fleet scale (thousands of emitters, high label cardinality per emitter):

  • A single-threaded collector caps out on CPU; scrape intervals miss; the dashboard "blanks" but no one notices because the metric you'd have noticed by is the metric being dropped.
  • A fixed-replica collector fleet is sized for peak and overprovisioned off-peak (cost drag) or sized for average and drops on peak (coverage drag).
  • A shared-infrastructure collector bottlenecks on the noisiest tenant and starves the rest.

Structure

  1. Collectors are a scalable tier, not a per-host sidecar fixed in concrete. Either auto-scaled deployment (HPA / platform-native), or sharded with dynamic reshard (see Dicer-style, systems/dicer).
  2. Work is shardable. Each collector owns a disjoint slice of the emitter space (per-node, per-service, per-label-hash). No single collector is on the critical path for the whole fleet.
  3. Scale signal is intrinsic to the workload. Number of emitters, samples/sec, queue depth — not CPU (because a CPU-pinned collector is already failing).
  4. Drop-before-block on backpressure. Collectors must degrade gracefully; a full queue drops the oldest / cheapest samples instead of blocking producers.

Canonical example: SageMaker HyperPod observability

Instead of single-threaded collectors struggling to process metrics from thousands of GPUs, we implemented auto-scaling collectors that grow and shrink with the workload. The system automatically correlates high-cardinality metrics generated within HyperPod using algorithms designed for massive scale time series data.

(Source: sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development)

The capability is pitched explicitly as the structural answer to the concepts/monitoring-paradox that foundation-model training teams had been hitting: monitoring agents filling disks, CPU-bound collectors dropping metrics, cascading training failures.

Adjacent example: Airbnb vmagent two-tier streaming aggregation

Airbnb's metrics pipeline uses systems/vmagent in a router → aggregator two-tier configuration; routers hash-shard samples to aggregator instances, each tier sized independently to the load profile. This is the same idea applied to the aggregation layer specifically — see concepts/streaming-aggregation. At 100M+ samples/sec they cited a ~10× cost reduction vs. the previous vendor. (Source: sources/2026-04-16-airbnb-statsd-to-otel-metrics-pipeline)

Trade-offs

  • Stateful collectors (doing aggregation, rate conversion, etc.) are harder to auto-scale than stateless ones because resharding moves in-memory counters; see patterns/state-transfer-on-reshard.
  • Over-aggressive scale-down can thrash during diurnal troughs — keep warm headroom.
  • Control-plane churn. Scaling the collector fleet generates its own events; don't let those drown the signal the tier exists to carry.

Seen in

Last updated · 200 distilled / 1,178 read