CONCEPT Cited by 1 source
Observability traffic volume asymmetry¶
Definition¶
Observability traffic volume asymmetry is the architectural property that at scale, an org's observability telemetry traffic (metrics scrapes, remote-writes, logs, traces) is orders of magnitude larger than its business traffic — and has qualitatively different demands on the networking layer.
Canonical datum (Source: sources/2026-05-05-airbnb-monitoring-reliably-at-scale):
"Observability data is uniquely high-volume. At Airbnb's scale, we send orders of magnitude more observability traffic than business traffic, which makes networking a key foundation of our observability stack."
Why the asymmetry exists¶
- Every service emits telemetry continuously; only some services handle business requests, and business traffic has natural user-driven cadence (mostly bursty, non-uniform).
- Metrics cardinality explodes with fleet size. ~1,000 services × hundreds of time series each × thousands of instances × 15-second scrape cadence = a samples-per-second rate that's decoupled from customer-request rate.
- Telemetry is persistent background load, not event-driven traffic. When business traffic dips overnight, observability traffic stays flat or rises (batch jobs, cron workflows, ops scripts).
- Retention amplifies write volume. Metrics pipelines typically ingest 100% even though only a small fraction of data is ever queried. Business traffic typically has a much tighter read-to-write ratio.
Three networking-layer consequences¶
The asymmetry is what forces a separate networking tier for telemetry, rather than carrying it through a shared service mesh. Three distinct failures follow from running telemetry and business traffic through the same mesh:
1. Circular dependency¶
If the service mesh's data-plane metrics flow through the same data plane, a mesh failure breaks the signal for diagnosing the mesh failure. Airbnb names this:
"We couldn't rely on the same data plane for monitoring product and infrastructure applications as for business traffic. That would create a circular dependency — metrics for the data plane would depend on that same data plane to be delivered."
See concepts/circular-dependency.
2. Congestion-induced blindness¶
At scale, telemetry volume grows faster than business volume. The shared mesh becomes a contention point where telemetry scrape failures correlate with high business traffic — i.e., your monitoring loses coverage exactly when the system is most stressed. Per the Airbnb post:
"As usage grew, congestion could make metrics unavailable, eroding critical debuggability for both platform engineers and product developers."
3. Telemetry spikes as a noisy-neighbour to business traffic¶
The reverse direction is equally real. A telemetry-side incident (metric storm, bad dashboard, post-deploy sample flood) can consume shared mesh capacity and degrade the product. Per the Airbnb post:
"Worse, telemetry spikes could also consume shared capacity and degrade or disrupt application traffic, directly impacting Airbnb.com availability."
This is the noisy-neighbour failure mode, but where the observability stack is the noisy neighbour — an often-overlooked direction.
Architectural response: separate the networking tier¶
Airbnb's response was to build a custom L7 Envoy tier specifically for observability traffic — independent of the Istio mesh that carries business traffic. Owning the telemetry networking layer let the team add:
- Strict prioritisation (telemetry can be de-prioritised during business-traffic spikes, or vice versa depending on the mode being optimised)
- Isolation from business-traffic contention
- Custom routing for telemetry use-cases (mirroring, vendor fan-out, per-tenant ACLs) that a general-purpose mesh isn't tuned for
Why service meshes aren't designed for this¶
The Airbnb post is explicit: "Our service mesh was originally designed around business workloads, not a world where every service continuously pushes telemetry to a central store."
Service meshes optimise for L7 features useful to request/ response traffic: retries, circuit breaking, mTLS, distributed tracing. The telemetry push pattern — continuous, high-volume, batched, one-directional — is a different workload shape that benefits from a purpose-built networking tier.
When the asymmetry doesn't force a split¶
Small orgs can ignore this. A startup with 20 services running through a single mesh rarely sees telemetry volume overwhelm business volume. The asymmetry becomes load- bearing when:
- Service count passes O(100)
- Metric cardinality per service is high
- Telemetry SLOs approach or exceed business-traffic SLOs
- Business-traffic dips leave telemetry as the dominant load (i.e., there's no quiet period)
Before that, Istio / Linkerd / whatever-mesh is fine for both. After that, the split is worth the engineering investment.
Caveats¶
- "Orders of magnitude" is not quantified in the Airbnb post. The ratio between Airbnb's telemetry and business traffic is not disclosed.
- Some of the asymmetry can be reduced at the source — streaming aggregation (vmagent), cardinality limits, sampling. These are complementary to, not substitutes for, the networking-layer split.
- Splitting the networking layer is not free. You now run a second proxy tier with its own operational burden.
Seen in¶
- sources/2026-05-05-airbnb-monitoring-reliably-at-scale — canonical wiki instance. Airbnb's "orders of magnitude more observability traffic than business traffic" framing is the anchor; the three failure modes (circular dependency, congestion-induced blindness, telemetry-as- noisy-neighbour) are all enumerated in the post.