Skip to content

PATTERN Cited by 2 sources

Kubernetes-API-driven custom load balancer

Pattern

Replace the default kube-proxy / round-robin / Service-ClusterIP load-balancing path with a lightweight in-house control plane that watches the Kubernetes API directly for Services and EndpointSlices, projects that into a streaming endpoint feed, and drives a custom client-side load-balancing algorithm (typically Power-of-Two-Choices) on the consumer side.

The pattern is the answer to the empirical observation that default Kubernetes round-robin LB degrades at high QPS, creating hotspots that spike tail latency. The fix needs three things at once: (a) a faster-than-DNS endpoint propagation channel, (b) endpoint metadata richer than a flat IP list, and (c) a per-request algorithm smarter than round-robin.

(Source: sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together; prior canonicalisation: sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing)

Components

  1. Endpoint Discovery Service (EDS) — a lightweight server that watches the Kubernetes API for changes to Services and EndpointSlices, maintains a live topology view (zone, readiness, shard labels per pod), and streams that topology to subscribers. (Source: systems/databricks-endpoint-discovery-service)
  2. Custom client-side LB algorithm. Subscribers pick the algorithm that fits the workload — typically P2C for stateless services, consistent hashing with bounded load for sharded ones.
  3. Subscriber endpoints. Two canonical shapes in the Databricks stack: in-process Armeria RPC clients (Scala services) and Envoy ingress gateways (xDS).
  4. No CoreDNS, no kube-proxy on the critical path. DNS is bypassed; kube-proxy's per-connection L4 pod selection is bypassed. Endpoint state flows directly from the K8s API to the LB algorithm.

Why it works

  • DNS TTLs are too coarse for endpoint changes. A scale-up event takes 10s of seconds to propagate via DNS even with aggressive TTLs. EDS pushes updates within a watch cycle.
  • EndpointSlices carry richer metadata than the flat IP list a DNS A-record can express — zone, shard, readiness flags. The custom LB can use them; default LB can't.
  • Round-robin doesn't account for variable request cost. A pod handling a slow request is given more requests by RR; P2C avoids this.

Canonical wiki disclosure: 200K QPS LLM serving (Superhuman)

The 2026-05-08 Databricks/Superhuman post canonicalises the pattern at production GPU inference scale:

"Superhuman's grammar correction endpoint traffic exhibits strong diurnal patterns with rapid ramps in certain periods, often exceeding 200k QPS. While the default Kubernetes round robin load balancer is sufficient at low QPS, our tests revealed that this performance degrades at higher QPS, with uneven request distribution creating hotspots that spike tail latency."

"At the core of our approach is the Endpoint Discovery Service (EDS) — a lightweight control plane that continuously monitors the Kubernetes API for changes to Services and EndpointSlices. EDS drives a custom load balancing algorithm based on the power of two choices. For each request, two candidate pods are sampled and traffic is routed to whichever has fewer active requests, preventing the hotspots that round-robin creates at high QPS."

The Superhuman ingest extends the prior 2025-10-01 EDS post (which canonicalised the same architecture for internal Databricks RPC traffic at the platform tier) with a public-internet-grade, GPU-inference-grade, 200K-QPS validation point. Same control plane, new altitude.

When to use

  • High-QPS service-to-service or ingress traffic where hotspots from round-robin become a tail-latency problem.
  • Workloads with variable per-request cost (LLM inference, retrieval, search) where load distribution must respond to in-flight work, not just request count.
  • Kubernetes-native deployment where watching the API is cheaper than running a separate service registry.
  • Need for endpoint metadata beyond IP list — zone affinity, shard awareness, readiness gating.

When not to use

  • Low-QPS workloads where round-robin's distribution is good enough; building EDS adds complexity without payoff.
  • Multi-cluster routing where the Kubernetes API of one cluster is not the source of truth; needs a federated registry.
  • External traffic without an Envoy / xDS-capable consumer; routing custom LB into a third-party ingress takes additional engineering.

Operational shape

                        Kubernetes API
                              │ watch
                +-------------+-------------+
                | Endpoint Discovery Service|
                | (EDS)                     |
                |  - Services               |
                |  - EndpointSlices         |
                |  - zone / shard / ready   |
                +------+--------------------+
                       │ xDS stream
            ┌──────────┼─────────────┐
            ▼                        ▼
   Armeria RPC clients      Envoy ingress gateways
   (in-process LB)          (xDS-driven LB)
            │                        │
            │ P2C                    │ P2C / CHLB
            ▼                        ▼
        Backend pod              Backend pod

Sibling patterns

  • Default kube-proxy / Service ClusterIP path — the path this pattern replaces. The custom path keeps it for compatibility but not on the critical path.
  • patterns/proxyless-service-mesh — the broader architecture family; EDS is the control plane that makes proxyless feasible.
  • patterns/power-of-two-choices — the load-distribution algorithm typically used on top of EDS.
  • Sidecar-proxy mesh (default Istio / Linkerd) — the competing architecture; EDS-driven custom LB avoids the per-pod proxy.

Failure modes

  • EDS lag — propagation delay between K8s API change and LB update. Bound by watch cycle latency.
  • EDS overload — too many subscribers, too many endpoints; the control plane bottlenecks. Mitigation: shard EDS, scope subscriptions to dependencies only.
  • Stale endpoint state — a pod shutting down before EDS notices → routing to a dying pod. Mitigation: graceful shutdown + readiness gate + active health checks on the consumer.
  • K8s API rate limits — heavy watch traffic can be throttled by the API server. Mitigation: subscribe to EndpointSlices (sharded) rather than full Endpoints, use API server caching.

Seen in

Caveats

  • The Superhuman post does not disclose the EDS scaling shape (how many subscribers, how many endpoints, watch fanout) at the 200K QPS altitude.
  • P2C-on-active-requests is the algorithm; alternative metrics (latency EWMA, weighted by expected cost) are not benchmarked.
  • The pattern is Databricks-internal infrastructure; not packaged as a reusable open-source module.
Last updated · 542 distilled / 1,571 read