PATTERN Cited by 2 sources
Kubernetes-API-driven custom load balancer¶
Pattern¶
Replace the default kube-proxy / round-robin / Service-ClusterIP
load-balancing path with a lightweight in-house control plane that
watches the Kubernetes API directly for Services and
EndpointSlices, projects that into a streaming endpoint feed, and
drives a custom client-side load-balancing algorithm (typically
Power-of-Two-Choices) on the
consumer side.
The pattern is the answer to the empirical observation that default Kubernetes round-robin LB degrades at high QPS, creating hotspots that spike tail latency. The fix needs three things at once: (a) a faster-than-DNS endpoint propagation channel, (b) endpoint metadata richer than a flat IP list, and (c) a per-request algorithm smarter than round-robin.
(Source: sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together; prior canonicalisation: sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing)
Components¶
- Endpoint Discovery Service (EDS) — a lightweight server that
watches the Kubernetes API for changes to
ServicesandEndpointSlices, maintains a live topology view (zone, readiness, shard labels per pod), and streams that topology to subscribers. (Source: systems/databricks-endpoint-discovery-service) - Custom client-side LB algorithm. Subscribers pick the algorithm that fits the workload — typically P2C for stateless services, consistent hashing with bounded load for sharded ones.
- Subscriber endpoints. Two canonical shapes in the Databricks stack: in-process Armeria RPC clients (Scala services) and Envoy ingress gateways (xDS).
- No CoreDNS, no kube-proxy on the critical path. DNS is bypassed; kube-proxy's per-connection L4 pod selection is bypassed. Endpoint state flows directly from the K8s API to the LB algorithm.
Why it works¶
- DNS TTLs are too coarse for endpoint changes. A scale-up event takes 10s of seconds to propagate via DNS even with aggressive TTLs. EDS pushes updates within a watch cycle.
- EndpointSlices carry richer metadata than the flat IP list a DNS A-record can express — zone, shard, readiness flags. The custom LB can use them; default LB can't.
- Round-robin doesn't account for variable request cost. A pod handling a slow request is given more requests by RR; P2C avoids this.
Canonical wiki disclosure: 200K QPS LLM serving (Superhuman)¶
The 2026-05-08 Databricks/Superhuman post canonicalises the pattern at production GPU inference scale:
"Superhuman's grammar correction endpoint traffic exhibits strong diurnal patterns with rapid ramps in certain periods, often exceeding 200k QPS. While the default Kubernetes round robin load balancer is sufficient at low QPS, our tests revealed that this performance degrades at higher QPS, with uneven request distribution creating hotspots that spike tail latency."
"At the core of our approach is the Endpoint Discovery Service (EDS) — a lightweight control plane that continuously monitors the Kubernetes API for changes to Services and EndpointSlices. EDS drives a custom load balancing algorithm based on the power of two choices. For each request, two candidate pods are sampled and traffic is routed to whichever has fewer active requests, preventing the hotspots that round-robin creates at high QPS."
The Superhuman ingest extends the prior 2025-10-01 EDS post (which canonicalised the same architecture for internal Databricks RPC traffic at the platform tier) with a public-internet-grade, GPU-inference-grade, 200K-QPS validation point. Same control plane, new altitude.
When to use¶
- High-QPS service-to-service or ingress traffic where hotspots from round-robin become a tail-latency problem.
- Workloads with variable per-request cost (LLM inference, retrieval, search) where load distribution must respond to in-flight work, not just request count.
- Kubernetes-native deployment where watching the API is cheaper than running a separate service registry.
- Need for endpoint metadata beyond IP list — zone affinity, shard awareness, readiness gating.
When not to use¶
- Low-QPS workloads where round-robin's distribution is good enough; building EDS adds complexity without payoff.
- Multi-cluster routing where the Kubernetes API of one cluster is not the source of truth; needs a federated registry.
- External traffic without an Envoy / xDS-capable consumer; routing custom LB into a third-party ingress takes additional engineering.
Operational shape¶
Kubernetes API
▲
│ watch
│
+-------------+-------------+
| Endpoint Discovery Service|
| (EDS) |
| - Services |
| - EndpointSlices |
| - zone / shard / ready |
+------+--------------------+
│ xDS stream
┌──────────┼─────────────┐
▼ ▼
Armeria RPC clients Envoy ingress gateways
(in-process LB) (xDS-driven LB)
│ │
│ P2C │ P2C / CHLB
▼ ▼
Backend pod Backend pod
Sibling patterns¶
- Default kube-proxy / Service ClusterIP path — the path this pattern replaces. The custom path keeps it for compatibility but not on the critical path.
- patterns/proxyless-service-mesh — the broader architecture family; EDS is the control plane that makes proxyless feasible.
- patterns/power-of-two-choices — the load-distribution algorithm typically used on top of EDS.
- Sidecar-proxy mesh (default Istio / Linkerd) — the competing architecture; EDS-driven custom LB avoids the per-pod proxy.
Failure modes¶
- EDS lag — propagation delay between K8s API change and LB update. Bound by watch cycle latency.
- EDS overload — too many subscribers, too many endpoints; the control plane bottlenecks. Mitigation: shard EDS, scope subscriptions to dependencies only.
- Stale endpoint state — a pod shutting down before EDS notices → routing to a dying pod. Mitigation: graceful shutdown + readiness gate + active health checks on the consumer.
- K8s API rate limits — heavy watch traffic can be throttled by the API server. Mitigation: subscribe to EndpointSlices (sharded) rather than full Endpoints, use API server caching.
Seen in¶
- sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together — canonical wiki instance at 200K+ QPS GPU inference altitude for Superhuman. Default K8s round-robin disclosed as inadequate at this QPS; EDS + P2C is the production fix; shadow-tested jointly by Databricks and Superhuman.
- sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing — prior canonicalisation at the internal RPC platform altitude; same control plane, applied to Armeria + Envoy consumers across Databricks' service mesh.
Caveats¶
- The Superhuman post does not disclose the EDS scaling shape (how many subscribers, how many endpoints, watch fanout) at the 200K QPS altitude.
- P2C-on-active-requests is the algorithm; alternative metrics (latency EWMA, weighted by expected cost) are not benchmarked.
- The pattern is Databricks-internal infrastructure; not packaged as a reusable open-source module.
Related¶
- patterns/power-of-two-choices — the LB algorithm of choice
- concepts/client-side-load-balancing — the deployment model
- concepts/control-plane-data-plane-separation — the architectural principle EDS embodies
- concepts/xds-protocol — the streaming dynamic-config API used to talk to Envoy consumers
- concepts/hotspot — the failure mode this pattern prevents
- concepts/tail-latency-at-scale — the SLO it defends
- systems/databricks-endpoint-discovery-service — the EDS implementation
- systems/kubernetes — the underlying source of truth
- systems/envoy — canonical xDS consumer
- patterns/proxyless-service-mesh — the broader architectural family