CONCEPT Cited by 1 source

Observability traffic volume asymmetry¶

Definition¶

Observability traffic volume asymmetry is the architectural property that at scale, an org's observability telemetry traffic (metrics scrapes, remote-writes, logs, traces) is orders of magnitude larger than its business traffic — and has qualitatively different demands on the networking layer.

Canonical datum (Source: sources/2026-05-05-airbnb-monitoring-reliably-at-scale):

"Observability data is uniquely high-volume. At Airbnb's scale, we send orders of magnitude more observability traffic than business traffic, which makes networking a key foundation of our observability stack."

Why the asymmetry exists¶

Every service emits telemetry continuously; only some services handle business requests, and business traffic has natural user-driven cadence (mostly bursty, non-uniform).
Metrics cardinality explodes with fleet size. ~1,000 services × hundreds of time series each × thousands of instances × 15-second scrape cadence = a samples-per-second rate that's decoupled from customer-request rate.
Telemetry is persistent background load, not event-driven traffic. When business traffic dips overnight, observability traffic stays flat or rises (batch jobs, cron workflows, ops scripts).
Retention amplifies write volume. Metrics pipelines typically ingest 100% even though only a small fraction of data is ever queried. Business traffic typically has a much tighter read-to-write ratio.

Three networking-layer consequences¶

The asymmetry is what forces a separate networking tier for telemetry, rather than carrying it through a shared service mesh. Three distinct failures follow from running telemetry and business traffic through the same mesh:

1. Circular dependency¶

If the service mesh's data-plane metrics flow through the same data plane, a mesh failure breaks the signal for diagnosing the mesh failure. Airbnb names this:

"We couldn't rely on the same data plane for monitoring product and infrastructure applications as for business traffic. That would create a circular dependency — metrics for the data plane would depend on that same data plane to be delivered."

See concepts/circular-dependency.

2. Congestion-induced blindness¶

At scale, telemetry volume grows faster than business volume. The shared mesh becomes a contention point where telemetry scrape failures correlate with high business traffic — i.e., your monitoring loses coverage exactly when the system is most stressed. Per the Airbnb post:

"As usage grew, congestion could make metrics unavailable, eroding critical debuggability for both platform engineers and product developers."

3. Telemetry spikes as a noisy-neighbour to business traffic¶

The reverse direction is equally real. A telemetry-side incident (metric storm, bad dashboard, post-deploy sample flood) can consume shared mesh capacity and degrade the product. Per the Airbnb post:

"Worse, telemetry spikes could also consume shared capacity and degrade or disrupt application traffic, directly impacting Airbnb.com availability."

This is the noisy-neighbour failure mode, but where the observability stack is the noisy neighbour — an often-overlooked direction.

Architectural response: separate the networking tier¶

Airbnb's response was to build a custom L7 Envoy tier specifically for observability traffic — independent of the Istio mesh that carries business traffic. Owning the telemetry networking layer let the team add:

Strict prioritisation (telemetry can be de-prioritised during business-traffic spikes, or vice versa depending on the mode being optimised)
Isolation from business-traffic contention
Custom routing for telemetry use-cases (mirroring, vendor fan-out, per-tenant ACLs) that a general-purpose mesh isn't tuned for

Why service meshes aren't designed for this¶

The Airbnb post is explicit: "Our service mesh was originally designed around business workloads, not a world where every service continuously pushes telemetry to a central store."

Service meshes optimise for L7 features useful to request/ response traffic: retries, circuit breaking, mTLS, distributed tracing. The telemetry push pattern — continuous, high-volume, batched, one-directional — is a different workload shape that benefits from a purpose-built networking tier.

When the asymmetry doesn't force a split¶

Small orgs can ignore this. A startup with 20 services running through a single mesh rarely sees telemetry volume overwhelm business volume. The asymmetry becomes load- bearing when:

Service count passes O(100)
Metric cardinality per service is high
Telemetry SLOs approach or exceed business-traffic SLOs
Business-traffic dips leave telemetry as the dominant load (i.e., there's no quiet period)

Before that, Istio / Linkerd / whatever-mesh is fine for both. After that, the split is worth the engineering investment.

Caveats¶

"Orders of magnitude" is not quantified in the Airbnb post. The ratio between Airbnb's telemetry and business traffic is not disclosed.
Some of the asymmetry can be reduced at the source — streaming aggregation (vmagent), cardinality limits, sampling. These are complementary to, not substitutes for, the networking-layer split.
Splitting the networking layer is not free. You now run a second proxy tier with its own operational burden.

Seen in¶

sources/2026-05-05-airbnb-monitoring-reliably-at-scale — canonical wiki instance. Airbnb's "orders of magnitude more observability traffic than business traffic" framing is the anchor; the three failure modes (circular dependency, congestion-induced blindness, telemetry-as- noisy-neighbour) are all enumerated in the post.