CONCEPT Cited by 1 source
kube-proxy iptables probabilistic distribution¶
Definition¶
When kube-proxy runs in its
iptables mode (historical default), it distributes traffic
across Service backends using per-packet probability jumps
in the iptables NAT chain — not a strict round-robin
scheduler. Each backend pod gets an iptables rule with a
statistic --mode random --probability p clause; p is tuned
so that in expectation each pod receives an equal share, but
observed distribution is statistical, not deterministic.
With a small number of backends and bursty traffic, this produces visibly non-uniform load — some pods get more traffic than others, a divergence that does not average out quickly.
Evidence in the wild¶
Observed by Zalando on their PgBouncer-on-Kubernetes test cluster with four pods behind a Service:
NAME CPU(cores) MEMORY(bytes)
pool-test-7d8bfbc47f-6bbhr 977m 5Mi
pool-test-7d8bfbc47f-8jtnp 995m 6Mi
pool-test-7d8bfbc47f-ghvpn 585m 6Mi ← ~59% of peers
pool-test-7d8bfbc47f-s945p 993m 6Mi
Three pods burn ~1 CPU core apiece; the fourth runs at 60% of that — a ~2× underutilisation relative to peers. This skew comes purely from iptables probability math, not from any workload or scheduling difference.
Mitigations¶
- Switch kube-proxy to IPVS mode — uses the Linux IPVS module with true scheduling algorithms (round-robin, least- connection, weighted round-robin).
- Switch kube-proxy to nftables mode (newer, similar semantics).
- Bypass kube-proxy entirely with an eBPF-based CNI (Cilium, Calico with eBPF) — these implement Service load balancing at lower cost and with more deterministic behaviour.
- Scale horizontally beyond 4 pods — with many pods the probability distribution mean converges faster and outliers matter less.
Why it matters¶
For latency-insensitive stateless workloads, the skew is usually tolerable. For CPU-intensive latency-sensitive services (connection poolers, media encoders, network appliances) that are sized to specific per-pod CPU budgets, a silent 2× imbalance wastes compute capacity (the underloaded pod) while pushing peers closer to saturation — the opposite of what autoscaling promises.
Seen in¶
- sources/2020-06-23-zalando-pgbouncer-on-kubernetes-minimal-latency — canonical first-person account; Kukushkin surfaces the effect while scaling PgBouncer pods. The 977/995/585/993 m datapoint is the reference illustration.