Skip to content

CONCEPT Cited by 1 source

Observability traffic volume asymmetry

Definition

Observability traffic volume asymmetry is the architectural property that at scale, an org's observability telemetry traffic (metrics scrapes, remote-writes, logs, traces) is orders of magnitude larger than its business traffic — and has qualitatively different demands on the networking layer.

Canonical datum (Source: sources/2026-05-05-airbnb-monitoring-reliably-at-scale):

"Observability data is uniquely high-volume. At Airbnb's scale, we send orders of magnitude more observability traffic than business traffic, which makes networking a key foundation of our observability stack."

Why the asymmetry exists

  • Every service emits telemetry continuously; only some services handle business requests, and business traffic has natural user-driven cadence (mostly bursty, non-uniform).
  • Metrics cardinality explodes with fleet size. ~1,000 services × hundreds of time series each × thousands of instances × 15-second scrape cadence = a samples-per-second rate that's decoupled from customer-request rate.
  • Telemetry is persistent background load, not event-driven traffic. When business traffic dips overnight, observability traffic stays flat or rises (batch jobs, cron workflows, ops scripts).
  • Retention amplifies write volume. Metrics pipelines typically ingest 100% even though only a small fraction of data is ever queried. Business traffic typically has a much tighter read-to-write ratio.

Three networking-layer consequences

The asymmetry is what forces a separate networking tier for telemetry, rather than carrying it through a shared service mesh. Three distinct failures follow from running telemetry and business traffic through the same mesh:

1. Circular dependency

If the service mesh's data-plane metrics flow through the same data plane, a mesh failure breaks the signal for diagnosing the mesh failure. Airbnb names this:

"We couldn't rely on the same data plane for monitoring product and infrastructure applications as for business traffic. That would create a circular dependency — metrics for the data plane would depend on that same data plane to be delivered."

See concepts/circular-dependency.

2. Congestion-induced blindness

At scale, telemetry volume grows faster than business volume. The shared mesh becomes a contention point where telemetry scrape failures correlate with high business traffic — i.e., your monitoring loses coverage exactly when the system is most stressed. Per the Airbnb post:

"As usage grew, congestion could make metrics unavailable, eroding critical debuggability for both platform engineers and product developers."

3. Telemetry spikes as a noisy-neighbour to business traffic

The reverse direction is equally real. A telemetry-side incident (metric storm, bad dashboard, post-deploy sample flood) can consume shared mesh capacity and degrade the product. Per the Airbnb post:

"Worse, telemetry spikes could also consume shared capacity and degrade or disrupt application traffic, directly impacting Airbnb.com availability."

This is the noisy-neighbour failure mode, but where the observability stack is the noisy neighbour — an often-overlooked direction.

Architectural response: separate the networking tier

Airbnb's response was to build a custom L7 Envoy tier specifically for observability traffic — independent of the Istio mesh that carries business traffic. Owning the telemetry networking layer let the team add:

  • Strict prioritisation (telemetry can be de-prioritised during business-traffic spikes, or vice versa depending on the mode being optimised)
  • Isolation from business-traffic contention
  • Custom routing for telemetry use-cases (mirroring, vendor fan-out, per-tenant ACLs) that a general-purpose mesh isn't tuned for

Why service meshes aren't designed for this

The Airbnb post is explicit: "Our service mesh was originally designed around business workloads, not a world where every service continuously pushes telemetry to a central store."

Service meshes optimise for L7 features useful to request/ response traffic: retries, circuit breaking, mTLS, distributed tracing. The telemetry push pattern — continuous, high-volume, batched, one-directional — is a different workload shape that benefits from a purpose-built networking tier.

When the asymmetry doesn't force a split

Small orgs can ignore this. A startup with 20 services running through a single mesh rarely sees telemetry volume overwhelm business volume. The asymmetry becomes load- bearing when:

  • Service count passes O(100)
  • Metric cardinality per service is high
  • Telemetry SLOs approach or exceed business-traffic SLOs
  • Business-traffic dips leave telemetry as the dominant load (i.e., there's no quiet period)

Before that, Istio / Linkerd / whatever-mesh is fine for both. After that, the split is worth the engineering investment.

Caveats

  • "Orders of magnitude" is not quantified in the Airbnb post. The ratio between Airbnb's telemetry and business traffic is not disclosed.
  • Some of the asymmetry can be reduced at the source — streaming aggregation (vmagent), cardinality limits, sampling. These are complementary to, not substitutes for, the networking-layer split.
  • Splitting the networking layer is not free. You now run a second proxy tier with its own operational burden.

Seen in

  • sources/2026-05-05-airbnb-monitoring-reliably-at-scale — canonical wiki instance. Airbnb's "orders of magnitude more observability traffic than business traffic" framing is the anchor; the three failure modes (circular dependency, congestion-induced blindness, telemetry-as- noisy-neighbour) are all enumerated in the post.
Last updated · 451 distilled / 1,324 read