Skip to content

AIRBNB 2026-04-16 Tier 2

Read original ↗

Airbnb — Building a high-volume metrics pipeline with OpenTelemetry and vmagent

Summary

Companion piece to Airbnb's in-house metrics migration (see sources/2026-03-17-airbnb-observability-ownership-migration): this post covers the collection + aggregation tier that feeds the Prometheus backend. Airbnb migrated from a StatsD/Veneur pipeline to an OTLP-first collection path via the OpenTelemetry Collector, then replaced Veneur's cost-saving aggregation with a sharded two-tier vmagent streaming aggregator scaling to 100M+ samples/sec. Along the way they hit two production-defining problems: memory pressure from high-cardinality OTLP cumulative metrics (fixed by delta temporality for top emitters) and systematic undercounting of sparse counters by PromQL rate() (fixed by a transparent zero-injection tweak inside the aggregator).

Key takeaways

  • Frontload data collection in a monitoring migration. Get all metrics into the new pipeline first so downstream assets (dashboards, alerts) can be validated on real data; exposes scale bottlenecks early.
  • Dual-write during protocol migration. A shared platform metrics library (~40% of services) was taught to emit both StatsD and OTLP; broad migration progress with low per-team friction. See patterns/dual-write-migration.
  • OTLP wins over StatsD on every axis except legacy fit. JVM profiling showed metrics CPU dropped from 10% → <1% of samples after the switch; OTLP is TCP-based (no UDP packet loss), removes an in-Collector StatsD→OTLP translation hop, and unlocks Prometheus-native exponential histograms with full fidelity.
  • High-cardinality OTLP emitters regress under cumulative temporality. Services emitting 10K+ samples/sec per instance saw memory pressure, increased GC, heap growth — because cumulative temporality requires retaining full state per metric-label combination between exports. Switching those services to delta temporality (AggregationTemporalitySelector.deltaPreferred()) fixed it; the rest of the fleet stayed on cumulative. Trade-off: delta exposes export failures as data gaps, cumulative exposes them as jumps between points. See concepts/metric-temporality.
  • Evaluated and rejected several aggregation options. Veneur (rewrite needed for Prometheus, ongoing fork maintenance), Prometheus recording rules (require storing raw data first, defeats cost goal), m3aggregator (architecture too complex), OTel Collector (no metric aggregation — proposal), Vector (no built-in scaling; Rust not widely adopted internally).
  • Chose vmagent with a two-tier architecture: routers + aggregators. Stateless routers consistent-hash on all labels except those being aggregated away (e.g., pod, host) to route a metric's samples to a single aggregator shard. Aggregators (StatefulSet — stable network identity) keep running totals in memory. Routers get a static list of aggregator hostnames on the command line, avoiding a service-discovery dependency. See systems/vmagent and concepts/streaming-aggregation.
  • vmagent chosen over Veneur-rewrite etc. because: native Prometheus streaming-aggregation support, horizontal sharding, ~10K-LOC understandable codebase, friendly docs. Airbnb added native-histogram support and Mimir-style multitenancy internally; contributed generic fixes upstream (VictoriaMetrics PRs #5931, #5938, #5990).
  • Scale outcome: hundreds of aggregators, 100M+ samples/sec per cluster, ~10× cost reduction. The centralized aggregation tier also became a convenient metric-level control point — drop bad metrics from a rollout, selectively dual-emit raw (pre-aggregation) metrics on demand.
  • rate() silently undercounts sparse counters. In Prometheus, a counter is cumulative from zero; the creation-and-first-increment happen atomically, so if the counter is reset (pod restart) before a second increment, rate() sees only one sample and the first increment is lost. Airbnb generates many low-rate, high-dimensional counters (e.g. requests per currency × user × region) where most series increment only a handful of times a day — so this edge case is the common case, not rare.
  • Solution: zero injection inside the aggregator. When an aggregated counter is flushed for the first time, emit a zero instead of the running total (with a delayed timestamp so it doesn't collide with prior samples). That seeds Prometheus with the zero baseline it assumes, so rate() can derive the first delta. Cost: first increment lags by one flush interval. Benefit: systematic, transparent to every user, no per-callsite changes, no PromQL hacks. See patterns/zero-injection-counter.
  • Rejected “obvious” fixes for the sparse-counter problem: pre-init all counters (impossible — can't enumerate label combinations), switch to gauges (against Prometheus conventions, same storage internally), ask teams to pad queries with PromQL hacks (pushes complexity to every dashboard owner).

Architecture numbers

  • Before: StatsD libs + Veneur sidecar → vendor. UDP-based; Veneur fork for per-instance label aggregation.
  • After: app → OTLP (preferred) / Prometheus (OSS) / StatsD (legacy fallback) → OTel Collectorvmagent routers (stateless, consistent-hash shard) → vmagent aggregators (stateful, running totals + zero injection) → Prometheus-based storage.
  • Scale: 100M+ samples/sec/cluster, hundreds of aggregators.
  • Perf win: metrics CPU 10% → <1% of JVM samples after OTLP switch.
  • Cost win: order-of-magnitude reduction from the aggregation tier.

Design decisions worth remembering

  1. Move to a vendor-neutral protocol (OTLP) before optimizing anything else. Keeps future backend swaps cheap.
  2. Prefer TCP/OTLP over UDP/StatsD for anything with backpressure sensitivity — UDP packet loss is silent.
  3. Don't let cumulative temporality be a default for extreme-cardinality emitters. Memory cost is linear in active series.
  4. Centralize transformations at the aggregator. Once you've built a stateful aggregation tier, it's the right place for: drop rules, conditional raw-metric passthrough, type-based tweaks like zero injection. Ad-hoc fixes at the edge (per callsite / per dashboard) cost more in the long run.
  5. Solve semantic mismatches (rate() on sparse counters) in the pipeline, not in users' queries. Observability UX shouldn't leak backend quirks.
  6. Build on a small, readable OSS codebase (vmagent, ~10K LOC) so the team can fork, patch, and contribute back without getting stuck.

Caveats / open items

  • Zero injection adds a one-flush-interval lag to the first visible increment of any counter — acceptable given the alternative.
  • Delta temporality makes failures show as gaps rather than jumps — explicitly accepted for high-volume services; needs to be considered when writing alerts over those metrics.
  • vmagent fork carries some Airbnb-specific patches (native histograms, Mimir-style multitenancy); only generic fixes went upstream.

Raw article

raw/airbnb/2026-04-16-moving-a-large-scale-metrics-pipeline-from-statsd-to-opentel-e2462f94.md

Original URL

Last updated · 200 distilled / 1,178 read