Skip to content

PATTERN Cited by 1 source

Custom L7 proxy for telemetry over service mesh

Pattern

Build a purpose-built L7 ingress tier for observability / telemetry traffic, running independent of the general-purpose service mesh that carries business traffic. The custom tier handles the volume, routing, and isolation requirements of telemetry in a way a shared mesh cannot.

When it fits

  • Orders-of-magnitude more telemetry than business traffic. The shared mesh was sized for business workloads; telemetry dominates. (See concepts/observability-traffic-volume-asymmetry.)
  • Circular-dependency risk. The mesh carrying its own observability traffic means a mesh failure blinds the operator at exactly the wrong moment. (See concepts/circular-dependency.)
  • Need for strict isolation between telemetry and business traffic. Either direction of spillover (noisy-neighbour) is unacceptable.
  • Routing requirements that diverge from what a mesh provides. Tenant-per-service, header-based routing, metric mirroring, vendor fan-out, per-tenant ACLs — these sit squarely in the observability team's domain and are easier to build onto a dedicated proxy than to extend through a general mesh.

When it doesn't fit

  • Small orgs (under ~100 services) where mesh congestion is not near-term, and the engineering investment in a dedicated proxy tier exceeds the marginal reliability gain.
  • Telemetry volume still << business volume. If the asymmetry hasn't materialised, adding a second proxy layer is unnecessary complexity.
  • Mesh is the only available routing substrate. Some orgs have mesh-only networking policies; a custom proxy may not be politically or operationally feasible.
  • Dedicated platform team is unavailable. Running a second proxy tier is ongoing operational cost.

Substrate

  • Envoy is the canonical choice — L7 proxy, CNCF graduated, battle-tested as both a mesh data plane and a standalone proxy. Airbnb's custom tier is built on Envoy.
  • Routing is typically header-driven: every request carries a tenant header, and the proxy looks the header up in an in-memory map to select the backend cluster. See concepts/tenant-header-routing.
  • Independent of the existing service mesh's control plane. If the mesh's control plane is down, the custom tier should keep working.

Canonical instance (Airbnb)

From the 2026-05-05 post:

"We built a custom Layer 7 network ingress layer based on Envoy that load-balances traffic and routes read and write requests to the right backends. Running this proxy independent of the shared compute layer added fault tolerance and shielded our ingest path from service-mesh failures."

Key properties:

  • Independent of the Istio mesh that carries Airbnb's business traffic.
  • Tenant-header-based routing for ~1,000 services, each its own tenant in a single global user space.
  • Extensibility hook for features the mesh doesn't provide: metric mirroring to alternate destinations for testing, fine-grained access controls for external vendor integrations and specialised use cases.
  • Ownership: the Observability team owns the proxy tier; the Cloud team owns the shared mesh.

The compute-vs-networking asymmetry in own-vs-adopt decisions

Airbnb explicitly contrasts its decision to adopt dedicated- but-managed Kubernetes (from the Cloud team) with its decision to own the networking layer:

"For compute, Kubernetes was already a mature, managed foundation operated by the Cloud team... The networking layer was different: our service mesh couldn't cleanly isolate and prioritize observability traffic from business traffic at our scale, and the features we needed — strict prioritization, isolation, and custom routing for telemetry — sat squarely within our team's domain. Owning this layer gave us the control we wanted and, compared to running Kubernetes ourselves, it was a much more straightforward surface to operate."

Principle: own the layer whose requirements diverge most from what the shared foundation provides, and adopt the layer where convergence is strong. Networking for telemetry is the divergent layer; Kubernetes for compute is the convergent layer.

Failure modes

  • The custom tier becomes another single point of failure. It must itself be HA; if it goes down, all telemetry stops.
  • Drift between mesh and custom-tier policy. If security / compliance rules apply at the mesh level and the custom tier bypasses the mesh, there's a policy gap unless the custom tier re-implements them.
  • Operational duplication. Two proxy tiers means two upgrade procedures, two sets of runbooks, two sets of metrics about the proxies themselves. The Observability team inherits the operational burden of the tier they built.
  • Client library coordination. If telemetry clients must include the tenant header, the instrumentation library becomes load-bearing for correct routing — a misconfigured or old-version library misroutes requests.

Relationship to other wiki patterns

  • patterns/proxyless-service-mesh — alternative posture at the business-traffic layer: skip the mesh entirely and push routing into client libraries. Opposite move; both reject the sidecar mesh model for their respective workloads.
  • patterns/separate-routing-from-model-selection — Netflix's sibling pattern at the ML-serving altitude: header-based routing via Envoy (Lightbulb + Envoy) avoids the serialization tax of an in-path proxy. Same shape; different workload class.

Caveats

  • No latency or p99 numbers disclosed for Airbnb's custom tier vs the mesh alternative.
  • The pattern is not prescriptive about whether the custom tier should be in-AZ, multi-AZ, or regional. Airbnb's topology choices are not enumerated.
  • No quantitative resource cost (CPU, memory, dollar cost) disclosed.

Seen in

  • sources/2026-05-05-airbnb-monitoring-reliably-at-scale — canonical wiki instance. Airbnb's Observability team built a custom Envoy-based L7 ingress tier for ~1,000- service tenant-header-routed telemetry, independent of the Istio mesh carrying business traffic. Three motivating concerns: circular dependency, congestion-induced blindness, and telemetry-as-noisy-neighbour on Airbnb.com traffic. Extensibility hooks for metric mirroring and fine-grained ACLs.
Last updated · 451 distilled / 1,324 read