CONCEPT Cited by 1 source

kube-proxy iptables probabilistic distribution¶

Definition¶

When kube-proxy runs in its iptables mode (historical default), it distributes traffic across Service backends using per-packet probability jumps in the iptables NAT chain — not a strict round-robin scheduler. Each backend pod gets an iptables rule with a statistic --mode random --probability p clause; p is tuned so that in expectation each pod receives an equal share, but observed distribution is statistical, not deterministic.

With a small number of backends and bursty traffic, this produces visibly non-uniform load — some pods get more traffic than others, a divergence that does not average out quickly.

Evidence in the wild¶

Observed by Zalando on their PgBouncer-on-Kubernetes test cluster with four pods behind a Service:

NAME                         CPU(cores)   MEMORY(bytes)
pool-test-7d8bfbc47f-6bbhr   977m         5Mi
pool-test-7d8bfbc47f-8jtnp   995m         6Mi
pool-test-7d8bfbc47f-ghvpn   585m         6Mi    ← ~59% of peers
pool-test-7d8bfbc47f-s945p   993m         6Mi

Three pods burn ~1 CPU core apiece; the fourth runs at 60% of that — a ~2× underutilisation relative to peers. This skew comes purely from iptables probability math, not from any workload or scheduling difference.

Mitigations¶

Switch kube-proxy to IPVS mode — uses the Linux IPVS module with true scheduling algorithms (round-robin, least- connection, weighted round-robin).
Switch kube-proxy to nftables mode (newer, similar semantics).
Bypass kube-proxy entirely with an eBPF-based CNI (Cilium, Calico with eBPF) — these implement Service load balancing at lower cost and with more deterministic behaviour.
Scale horizontally beyond 4 pods — with many pods the probability distribution mean converges faster and outliers matter less.

Why it matters¶

For latency-insensitive stateless workloads, the skew is usually tolerable. For CPU-intensive latency-sensitive services (connection poolers, media encoders, network appliances) that are sized to specific per-pod CPU budgets, a silent 2× imbalance wastes compute capacity (the underloaded pod) while pushing peers closer to saturation — the opposite of what autoscaling promises.

Seen in¶

sources/2020-06-23-zalando-pgbouncer-on-kubernetes-minimal-latency — canonical first-person account; Kukushkin surfaces the effect while scaling PgBouncer pods. The 977/995/585/993 m datapoint is the reference illustration.

systems/kube-proxy · systems/kubernetes