DATABRICKS 2025-10-01 Tier 3

Intelligent Kubernetes Load Balancing at Databricks¶

Summary¶

Databricks replaced Kubernetes' default L4 kube-proxy load balancing with an in-house proxyless, client-side L7 load-balancing system backed by a custom xDS control plane (Endpoint Discovery Service / EDS). The core problem: gRPC over HTTP/2 uses long-lived connections, so kube-proxy's per-connection pod selection produces traffic skew, high tail latency, and over-provisioning. The Databricks RPC client (built into a shared Scala/Armeria framework) subscribes to EDS, maintains a live map of healthy endpoints with zone/shard metadata, and does Power of Two Choices load-balancing per request — with zone-affinity routing and automatic spillover on top. The same EDS speaks xDS to Envoy for ingress, so internal and external traffic share one source of truth. Result: uniform QPS, stabilized P90 latency, ~20% pod-count reduction across services. Istio / sidecar meshes and Kubernetes headless services were explicitly evaluated and rejected.

Key takeaways¶

Layer-4 load-balancing + long-lived HTTP/2 connections = traffic skew. kube-proxy picks a pod once per TCP connection; gRPC reuses that connection, so a few pods hot-spot while others idle. The fix has to live at Layer 7, per-request, not at the kernel's IPVS/eBPF tables. (Source: sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing)
DNS is the wrong control plane for dynamic service topology. CoreDNS caches and has no channel for endpoint metadata (zone, shard, readiness); headless services leak pod IPs but still can't carry weights or topology labels. Databricks removed DNS from the critical path and substituted a dedicated xDS control plane. (Source: sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing)
xDS is the integration contract, not just a sidecar-mesh thing. The same EDS that streams endpoint updates to Armeria RPC clients also programs ClusterLoadAssignment on Envoy ingress gateways — so internal clients and the public edge route off one source of truth. (systems/envoy / concepts/xds-protocol). (Source: sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing)
Power of Two Choices is enough for most services. The team explicitly says P2C — pick two random backends, send to the one with fewer active connections / lower load — "strikes a strong balance between performance and implementation simplicity, consistently leading to uniform traffic distribution." More sophisticated algorithms (CPU-skewed, metrics-based) were tried and retracted. (Source: sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing)
Metrics-based routing is a trap. CPU/resource-utilization signals from monitoring have different SLOs than request-serving, and CPU is a trailing indicator. Databricks abandoned metric-driven load balancing and now routes off direct health signals observed by the client itself. (Source: sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing)
Proxyless > sidecar mesh, when you have one language. Istio / Ambient Mesh were rejected because (a) sidecar ops cost scales per-pod at Databricks' scale, (b) proxies add CPU/memory/latency, (c) request-aware strategies are hard through an external proxy. Key enabler: Databricks is "predominantly Scala" on a monorepo + fast CI, so a shared client library ships once. "Sidecar meshes win when you're polyglot; proxyless wins when you're mono-lingual and can ship a library everywhere." (Source: sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing)
Client-side LB surfaces a cold-start problem. Before: new pods joined the pool but sat behind long-lived connections, so ramp-up was implicit. After: new pods start taking traffic immediately and crash / error before warming up. Fix = slow-start ramp-up + biasing traffic away from pods with high observed error rates + a dedicated warmup framework. (Source: sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing)
Zone-affinity needs a spillover policy, or an under-capacity zone breaks it. The router prefers in-zone pods (saving cross-AZ latency and data-transfer cost) but spills to other zones when local capacity is insufficient or unhealthy. Without the spillover leg, zone-affinity degrades to a local brownout. (Source: sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing)
Client-library integration is a coverage model, not a complete one. Non-Scala services and infrastructure LB paths (ingress-only traffic, non-Armeria clients) are outside the scope of client-side LB — an acknowledged gap that is the polyglot argument for sidecar meshes.

Architecture¶

Control plane (EDS): - Watches Kubernetes API for Services and EndpointSlices. - Maintains a live view of backend pods, with metadata: zone, readiness, shard labels. - Serves xDS to: 1. Armeria RPC clients in Scala services (internal traffic). 2. Envoy ingress gateways via EDS / ClusterLoadAssignment (external traffic).

Data plane: - No sidecar. Each service process embeds the Armeria client library. - Client subscribes to EDS on connection setup for the services it depends on. - Client maintains dynamic healthy-endpoint list + metadata; updates on control-plane push. - Bypasses CoreDNS and kube-proxy entirely.

Request routing: - Per-request (not per-connection) decision. - Default: Power of Two Choices (P2C) — pick 2 backends at random, route to the one with fewer in-flight / lower load. - Zone-affinity mode: prefer same-zone backends, spill to remote zones on local capacity/health pressure. - Pluggable strategies per service.

Observed impact: - Uniform server-side QPS across backend pods (from visibly skewed → flat). - P90 latency stabilized; previously-observed long-tail behavior on gRPC workloads reduced. - ~20% pod-count reduction across several services (fewer pods needed because load is actually balanced).

Operational numbers¶

Scale: "hundreds of stateless services" per Kubernetes cluster, communicating over gRPC.
Languages: "predominantly Scala" monorepo.
Capacity win: ~20% reduction in pod count across several services after rollout.
Fleet: "thousands of Kubernetes clusters across multiple regions" (future cross-cluster direction).

Systems / concepts / patterns extracted¶

systems/databricks-endpoint-discovery-service — Databricks' custom xDS control plane watching Kubernetes services/EndpointSlices, feeding both Armeria clients and Envoy gateways.
systems/envoy — open-source L7 proxy; EDS client for ingress-layer routing at Databricks; also the data-plane proxy in Istio/sidecar-mesh comparisons.
systems/kubernetes — underlying orchestrator; the default LB model (CoreDNS + kube-proxy + ClusterIP) is what this post replaces.
systems/kube-proxy — Kubernetes' default L4 service LB (iptables/IPVS/eBPF); insufficient for long-lived HTTP/2.
systems/coredns — Kubernetes cluster DNS; removed from the critical path in Databricks' design.
systems/grpc — HTTP/2-over-TCP RPC; the persistent-connection workload that breaks L4 LB.
systems/armeria — Databricks' shared Scala RPC framework; host of the embedded client-side LB library.
systems/istio — sidecar-based service mesh; explicitly considered and rejected for this problem.
concepts/client-side-load-balancing — general pattern of pushing LB decisions into the caller rather than a proxy or kernel hop.
concepts/xds-protocol — Envoy's dynamic-config API; used here beyond sidecars as a general service-discovery contract.
concepts/layer-7-load-balancing — per-request LB at the application protocol layer, vs. per-connection LB at L4.
concepts/control-plane-data-plane-separation — EDS (decide) vs. client/Envoy (deliver).
concepts/tail-latency-at-scale — the observed problem: fanned-out gRPC traffic inherits the worst hot-spotted pod's latency.
patterns/power-of-two-choices — P2C load-balancing algorithm; Databricks' default.
patterns/zone-affinity-routing — zone-local preference with capacity-driven spillover.
patterns/proxyless-service-mesh — mesh capabilities (discovery, LB, health-aware routing) delivered via shared library, not sidecars.
patterns/slow-start-ramp-up — throttle new pod's share of traffic until it's warmed up.

Caveats¶

Language coverage. Client-library integration only helps callers written in a language with the library. Non-Scala paths, and any edge-to-service traffic that doesn't flow through Armeria, still rely on infrastructure LB. Explicit acknowledged gap.
Staleness is bounded by EDS, not by DNS. Correctness of routing now depends on the EDS streaming pipeline being fresh; an EDS bug blasts all clients simultaneously (vs. DNS caching which at least provides a slow-decay fallback).
P2C is the common-case result, not a universal claim. Databricks explicitly notes zone-aware and more advanced strategies required "careful tuning and deeper context" — dedicated follow-up post deferred.
Metrics-based routing did not work in their environment — but the post doesn't claim that's universal. The failure mode cited is SLO/timing mismatch between monitoring and serving, which is specific to their stack.
Not a cross-cluster design yet. Current scope is within a single Kubernetes cluster. Multi-region / multi-cluster EDS is an explicit future direction; flat L3 and/or service-mesh solutions are being evaluated.
Tier 3 source (Databricks). Content is clearly architectural (infra design, production impact numbers, alternatives considered with reasons) so included despite tier.

Raw¶

Raw: raw/databricks/2025-10-01-intelligent-kubernetes-load-balancing-at-databricks-1f0bebbb.md
URL: https://www.databricks.com/blog/intelligent-kubernetes-load-balancing-databricks
HN: 130 points / https://news.ycombinator.com/item?id=45434417