Skip to content

SYSTEM Cited by 2 sources

Databricks Endpoint Discovery Service (EDS)

Databricks' in-house xDS control plane for Kubernetes service discovery and load balancing. A lightweight server that watches the Kubernetes API for Services and EndpointSlices, maintains a live topology view (zone, readiness, shard labels per pod), and streams that topology as xDS / EDS responses to two kinds of clients:

  1. Armeria RPC clients embedded in Scala services (internal service-to-service traffic).
  2. Envoy ingress gateways via standard xDS, programming ClusterLoadAssignment resources (public-facing traffic).

This single source of truth is the trick: internal and external traffic route off the same endpoint state, with the same health/zone metadata, without going through CoreDNS or kube-proxy on the critical path.

What it replaces

  • CoreDNS for intra-cluster service resolution (kept for compatibility, off critical path).
  • kube-proxy's per-connection L4 pod selection, which causes traffic skew on long-lived HTTP/2 / gRPC connections.
  • Kubernetes ClusterIP → pod-IP NAT shim in the kernel.

What it emits

  • xDS responses (LDS/CDS/EDS equivalents; Databricks centres on EDS in the post) to subscribing clients.
  • Endpoint metadata: zone, readiness, shard label, pod health as observed via EndpointSlices.
  • For Envoy consumers: ClusterLoadAssignment resources so ingress routing matches internal routing.

Design notes

  • Bypasses DNS entirely on the critical path. DNS caching and lack of metadata were the explicit motivation; EDS pushes updates rather than relying on TTLs.
  • Read-only projection of Kubernetes state — it doesn't own the truth, it reprojects EndpointSlices into a streaming, topology-aware feed.
  • Horizontal scaling is implicit: clients subscribe only to services they depend on, and the control plane's workload is O(watch + projection), not O(request).
  • Same control plane for multiple consumer shapes (RPC client library + Envoy) — the xDS protocol is the compatibility layer.

Seen in

  • sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing — Databricks describes EDS as its custom control plane; details what it watches, what it emits, and how it integrates with Armeria clients and Envoy ingress.
  • sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-togetherEDS at 200,000+ QPS on managed external inference. The 2026-05-08 Databricks / Superhuman post promotes EDS out of intra-cluster RPC and into the Databricks Model Serving stack as the substrate for a custom Power-of-Two-Choices load balancer: "At the core of our approach is the Endpoint Discovery Service (EDS) — a lightweight control plane that continuously monitors the Kubernetes API for changes to Services and EndpointSlices. EDS drives a custom load balancing algorithm based on the power of two choices." This is the canonical wiki extension of EDS from intra-cluster service mesh to managed external inference at 200K+ QPS sustained with sub-1s p99 and 4-9's reliability, framed explicitly against default Kubernetes round-robin: "While the default Kubernetes round robin load balancer is sufficient at low QPS, our tests revealed that this performance degrades at higher QPS, with uneven request distribution creating hotspots that spike tail latency." Same EDS substrate, expanded blast radius — internal service-to-service plus external real-time inference now share the same K8s-API-watch + endpoint-state-streaming control plane.
Last updated · 542 distilled / 1,571 read