Intelligent Kubernetes Load Balancing at Databricks¶
Summary¶
Databricks replaced Kubernetes' default L4 kube-proxy load balancing with an in-house proxyless, client-side L7 load-balancing system backed by a custom xDS control plane (Endpoint Discovery Service / EDS). The core problem: gRPC over HTTP/2 uses long-lived connections, so kube-proxy's per-connection pod selection produces traffic skew, high tail latency, and over-provisioning. The Databricks RPC client (built into a shared Scala/Armeria framework) subscribes to EDS, maintains a live map of healthy endpoints with zone/shard metadata, and does Power of Two Choices load-balancing per request — with zone-affinity routing and automatic spillover on top. The same EDS speaks xDS to Envoy for ingress, so internal and external traffic share one source of truth. Result: uniform QPS, stabilized P90 latency, ~20% pod-count reduction across services. Istio / sidecar meshes and Kubernetes headless services were explicitly evaluated and rejected.
Key takeaways¶
- Layer-4 load-balancing + long-lived HTTP/2 connections = traffic skew. kube-proxy picks a pod once per TCP connection; gRPC reuses that connection, so a few pods hot-spot while others idle. The fix has to live at Layer 7, per-request, not at the kernel's IPVS/eBPF tables. (Source: sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing)
- DNS is the wrong control plane for dynamic service topology. CoreDNS caches and has no channel for endpoint metadata (zone, shard, readiness); headless services leak pod IPs but still can't carry weights or topology labels. Databricks removed DNS from the critical path and substituted a dedicated xDS control plane. (Source: sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing)
- xDS is the integration contract, not just a sidecar-mesh thing. The same EDS that streams endpoint updates to Armeria RPC clients also programs
ClusterLoadAssignmenton Envoy ingress gateways — so internal clients and the public edge route off one source of truth. (systems/envoy / concepts/xds-protocol). (Source: sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing) - Power of Two Choices is enough for most services. The team explicitly says P2C — pick two random backends, send to the one with fewer active connections / lower load — "strikes a strong balance between performance and implementation simplicity, consistently leading to uniform traffic distribution." More sophisticated algorithms (CPU-skewed, metrics-based) were tried and retracted. (Source: sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing)
- Metrics-based routing is a trap. CPU/resource-utilization signals from monitoring have different SLOs than request-serving, and CPU is a trailing indicator. Databricks abandoned metric-driven load balancing and now routes off direct health signals observed by the client itself. (Source: sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing)
- Proxyless > sidecar mesh, when you have one language. Istio / Ambient Mesh were rejected because (a) sidecar ops cost scales per-pod at Databricks' scale, (b) proxies add CPU/memory/latency, (c) request-aware strategies are hard through an external proxy. Key enabler: Databricks is "predominantly Scala" on a monorepo + fast CI, so a shared client library ships once. "Sidecar meshes win when you're polyglot; proxyless wins when you're mono-lingual and can ship a library everywhere." (Source: sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing)
- Client-side LB surfaces a cold-start problem. Before: new pods joined the pool but sat behind long-lived connections, so ramp-up was implicit. After: new pods start taking traffic immediately and crash / error before warming up. Fix = slow-start ramp-up + biasing traffic away from pods with high observed error rates + a dedicated warmup framework. (Source: sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing)
- Zone-affinity needs a spillover policy, or an under-capacity zone breaks it. The router prefers in-zone pods (saving cross-AZ latency and data-transfer cost) but spills to other zones when local capacity is insufficient or unhealthy. Without the spillover leg, zone-affinity degrades to a local brownout. (Source: sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing)
- Client-library integration is a coverage model, not a complete one. Non-Scala services and infrastructure LB paths (ingress-only traffic, non-Armeria clients) are outside the scope of client-side LB — an acknowledged gap that is the polyglot argument for sidecar meshes.
Architecture¶
Control plane (EDS):
- Watches Kubernetes API for Services and EndpointSlices.
- Maintains a live view of backend pods, with metadata: zone, readiness, shard labels.
- Serves xDS to:
1. Armeria RPC clients in Scala services (internal traffic).
2. Envoy ingress gateways via EDS / ClusterLoadAssignment (external traffic).
Data plane: - No sidecar. Each service process embeds the Armeria client library. - Client subscribes to EDS on connection setup for the services it depends on. - Client maintains dynamic healthy-endpoint list + metadata; updates on control-plane push. - Bypasses CoreDNS and kube-proxy entirely.
Request routing: - Per-request (not per-connection) decision. - Default: Power of Two Choices (P2C) — pick 2 backends at random, route to the one with fewer in-flight / lower load. - Zone-affinity mode: prefer same-zone backends, spill to remote zones on local capacity/health pressure. - Pluggable strategies per service.
Observed impact: - Uniform server-side QPS across backend pods (from visibly skewed → flat). - P90 latency stabilized; previously-observed long-tail behavior on gRPC workloads reduced. - ~20% pod-count reduction across several services (fewer pods needed because load is actually balanced).
Operational numbers¶
- Scale: "hundreds of stateless services" per Kubernetes cluster, communicating over gRPC.
- Languages: "predominantly Scala" monorepo.
- Capacity win: ~20% reduction in pod count across several services after rollout.
- Fleet: "thousands of Kubernetes clusters across multiple regions" (future cross-cluster direction).
Systems / concepts / patterns extracted¶
- systems/databricks-endpoint-discovery-service — Databricks' custom xDS control plane watching Kubernetes services/EndpointSlices, feeding both Armeria clients and Envoy gateways.
- systems/envoy — open-source L7 proxy; EDS client for ingress-layer routing at Databricks; also the data-plane proxy in Istio/sidecar-mesh comparisons.
- systems/kubernetes — underlying orchestrator; the default LB model (CoreDNS + kube-proxy + ClusterIP) is what this post replaces.
- systems/kube-proxy — Kubernetes' default L4 service LB (iptables/IPVS/eBPF); insufficient for long-lived HTTP/2.
- systems/coredns — Kubernetes cluster DNS; removed from the critical path in Databricks' design.
- systems/grpc — HTTP/2-over-TCP RPC; the persistent-connection workload that breaks L4 LB.
- systems/armeria — Databricks' shared Scala RPC framework; host of the embedded client-side LB library.
- systems/istio — sidecar-based service mesh; explicitly considered and rejected for this problem.
- concepts/client-side-load-balancing — general pattern of pushing LB decisions into the caller rather than a proxy or kernel hop.
- concepts/xds-protocol — Envoy's dynamic-config API; used here beyond sidecars as a general service-discovery contract.
- concepts/layer-7-load-balancing — per-request LB at the application protocol layer, vs. per-connection LB at L4.
- concepts/control-plane-data-plane-separation — EDS (decide) vs. client/Envoy (deliver).
- concepts/tail-latency-at-scale — the observed problem: fanned-out gRPC traffic inherits the worst hot-spotted pod's latency.
- patterns/power-of-two-choices — P2C load-balancing algorithm; Databricks' default.
- patterns/zone-affinity-routing — zone-local preference with capacity-driven spillover.
- patterns/proxyless-service-mesh — mesh capabilities (discovery, LB, health-aware routing) delivered via shared library, not sidecars.
- patterns/slow-start-ramp-up — throttle new pod's share of traffic until it's warmed up.
Caveats¶
- Language coverage. Client-library integration only helps callers written in a language with the library. Non-Scala paths, and any edge-to-service traffic that doesn't flow through Armeria, still rely on infrastructure LB. Explicit acknowledged gap.
- Staleness is bounded by EDS, not by DNS. Correctness of routing now depends on the EDS streaming pipeline being fresh; an EDS bug blasts all clients simultaneously (vs. DNS caching which at least provides a slow-decay fallback).
- P2C is the common-case result, not a universal claim. Databricks explicitly notes zone-aware and more advanced strategies required "careful tuning and deeper context" — dedicated follow-up post deferred.
- Metrics-based routing did not work in their environment — but the post doesn't claim that's universal. The failure mode cited is SLO/timing mismatch between monitoring and serving, which is specific to their stack.
- Not a cross-cluster design yet. Current scope is within a single Kubernetes cluster. Multi-region / multi-cluster EDS is an explicit future direction; flat L3 and/or service-mesh solutions are being evaluated.
- Tier 3 source (Databricks). Content is clearly architectural (infra design, production impact numbers, alternatives considered with reasons) so included despite tier.
Raw¶
- Raw:
raw/databricks/2025-10-01-intelligent-kubernetes-load-balancing-at-databricks-1f0bebbb.md - URL: https://www.databricks.com/blog/intelligent-kubernetes-load-balancing-databricks
- HN: 130 points / https://news.ycombinator.com/item?id=45434417