ZALANDO 2020-06-23

Zalando — PgBouncer on Kubernetes and how to achieve minimal latency¶

Summary¶

Alexander Kukushkin (Zalando, 2020-06-23) explores the latency implications of running PgBouncer on Kubernetes as part of the Zalando Postgres Operator 1.5 connection-pooling feature. Two surprising findings emerge. First, Kubernetes Service in default iptables kube-proxy mode distributes load non-uniformly — one out of four PgBouncer pods received roughly half the traffic of its peers because iptables computes per-packet probabilities to land on a backend rather than enforcing strict round-robin. Second, hyperthreading significantly inflates PgBouncer latency when two so_reuseport instances land on sibling hyperthreads of the same physical core: softirq NET_RX/NET_TX handler latencies climb, echoing the Linux kernel scaling doc's warning that "for interrupt handling, HT has shown no benefit in initial tests, so limit the number of queues to the number of CPU cores in the system." The operator-level conclusion: if you need tight latency, pin PgBouncer to a real core via the Kubernetes CPU Manager static policy; for everyone else, accept scatter across AZs as the price of availability. Zalando's Operator ships the pragmatic default — single pooler Deployment exposed via a Service, distributed across availability zones — with an escape hatch for single-AZ affinity when latency variance cannot be tolerated.

Key takeaways¶

Why connection poolers at all — Postgres uses a process-per-connection client/server model; too many connections cause CPU fighting, context switches, and CPU migrations. Additionally GetSnapshotData in the transaction system has O(connections) complexity, so the cost scales with open connections regardless of activity. The pooler options are: inside the database (proposed patch, unmerged), as a separate component (PgBouncer, Pgpool-II, Odyssey, pgagroal), or on the application side. Zalando picked the separate component path because the application-side option is out of the operator's control and an internal pooler is "a major feature one needs to develop yet."
PgBouncer selected over Pgpool-II, Odyssey, pgagroal — PgBouncer is "probably the most popular and the oldest"; Pgpool-II "can actually do much more than just connection pooling (e.g. it can do load balancing), but it means it's a bit more heavyweight"; Odyssey and pgagroal are newer and try to be more performance-optimized. "Current implementation allow us to switch to any other solutions if they conform to a basic common standard."
Kubernetes Service load distribution is probabilistic, not uniform — observed in production:
```
NAME                         CPU(cores)   MEMORY(bytes)
pool-test-7d8bfbc47f-6bbhr   977m         5Mi
pool-test-7d8bfbc47f-8jtnp   995m         6Mi
pool-test-7d8bfbc47f-ghvpn   585m         6Mi    ← half-loaded
pool-test-7d8bfbc47f-s945p   993m         6Mi
```
"This could happen if kube-proxy works in iptables mode and calculates probabilities to land on a pod instead of strict round-robin." See concepts/kube-proxy-iptables-probability.
Benchmark methodology: network namespace + veth + netem — Kukushkin builds a reproducible low-noise harness on his laptop:
```
ip link add veth0 type veth peer name veth1
ip netns add db
ip link set veth1 netns db
ip addr add 10.0.0.10/24 dev veth0
ip netns exec db ip addr add 10.0.0.1/24 dev veth1
tc qdisc add dev veth0 root netem delay 1ms 0.1ms distribution normal
```
Adds a 1ms ± 0.1ms delay to approximate observed Kubernetes cluster latency. pgbench with a trivial ; query (the smallest valid SQL) — "the idea is to not load the database itself too much and see how PgBouncer instance will handle many connections" — 1000 connections dispatched via 8 threads. CPUs isolated via cpuset, Intel turbo disabled, performance governor set. See concepts/network-namespace-benchmarking.
Three-way CPU-placement experiment on a 2-physical-core / 2-HT-per-core laptop:
1. One PgBouncer on an isolated real core — lowest latency.
2. Two PgBouncers on isolated hyperthreads of the same physical core — latency "almost two times higher (with somewhat minimal increase in throughput)".
3. Two PgBouncers on isolated separate real cores (with potential noise from other components on the other HT) — latency "somewhere in between (with the throughput best of the three)".
Root cause: softirq NET_RX / NET_TX latencies degrade on shared hyperthreads. Kukushkin probes irq:softirq_entry / irq:softirq_exit tracepoints with Brendan Gregg's perf script:
```
perf record -e irq:softirq_entry,irq:softirq_exit \
    -a -C 2 --filter 'vec == 2 || vec == 3'
```
The 99th-percentile softirq latency for vec == 2 / 3 (NET_RX / NET_TX) is higher when both PgBouncers share a physical core. Kernel docs confirm the mechanism: "For interrupt handling, HT has shown no benefit in initial tests, so limit the number of queues to the number of CPU cores in the system." See concepts/hyperthread-softirq-contention.
Operator-level mitigation: CPU Manager static policy — "it could be beneficial to configure CPU manager in the cluster, so that this would not be an issue." The Kubernetes CPU Manager static policy allows exclusive-CPU pinning via cpuset, preventing a pool pod from landing on a sibling hyperthread of an already-busy core. See concepts/cpu-manager-static-policy.
so_reuseport as the PgBouncer scaling primitive — the experiment ran two PgBouncer instances with so_reuseport, "essentially a way to get PgBouncer to use more CPU cores." Multiple processes bind to the same port; the kernel distributes accepts among them. See concepts/so-reuseport-pgbouncer-scaling.
Zalando Operator's deployment shape (the pragmatic default):
- Single connection-pooler Deployment per Postgres cluster, exposed via a new Service.
- Pooler pods distributed across availability zones.
- Pooler pods are CPU-intensive with low memory (<100 MB simple case) — "it makes sense to create as many as needed to prevent resource saturation."
- Trade-off acknowledged: pods scattered across nodes / AZs → latency variability.
- Escape hatch for latency-sensitive workloads: manually create a single "big" pooler instance with affinity to the same node as the database, configure CPU Manager, use a secondary smaller pooler for HA. See patterns/big-pooler-affinity-plus-small-pooler-ha.

Systems extracted¶

PgBouncer — the selected pooler; canonical behavior probed on Kubernetes here.
PostgreSQL — the process-per-connection server whose connection cost motivates pooling.
Kubernetes — the substrate; Service load-balancing and CPU Manager feature here.
kube-proxy — iptables-mode load distribution is non-uniform.
Zalando Postgres Operator — the Kubernetes operator that owns the pooler deployment topology.
Pgpool-II — alternative pooler, rejected as too heavyweight.
Odyssey — Yandex's newer pooler, noted.
pgagroal — another newer pooler, noted.

Concepts extracted¶

concepts/kube-proxy-iptables-probability — Kubernetes Service in iptables mode computes per-packet probabilities across backends rather than strict round-robin. Canonical datum: one of four PgBouncer pods received ~half the traffic of peers.
concepts/so-reuseport-pgbouncer-scaling — PgBouncer's so_reuseport option lets multiple processes share a port so the pooler can use multiple CPU cores.
concepts/hyperthread-softirq-contention — softirq NET_RX / NET_TX handler latency is higher when two latency-sensitive processes share a physical core via hyperthreads. Kernel-doc-backed; Linux recommends queue count ≤ physical cores.
concepts/cpu-manager-static-policy — Kubernetes CPU Manager feature that exclusively pins pods to cpusets, avoiding sibling-hyperthread placement problems.
concepts/network-namespace-benchmarking — reproducible network-stack experiments via ip netns + veth + tc netem on a single host.
concepts/getsnapshotdata-o-n — Postgres's transaction- visibility function is O(connections), making the connection count a first-class cost dimension.
concepts/process-per-connection-postgres — already on wiki; this source re-canonicalises as the root motivation for pooling.

Patterns extracted¶

patterns/connection-pooler-as-separate-deployment — the canonical shape Zalando picked: one pooler Deployment per Postgres cluster, Service-fronted, AZ-spread.
patterns/fixed-cpu-pinning-for-latency-sensitive-pool — use CPU Manager static policy + cpuset to pin the pooler off shared hyperthreads.
patterns/so-reuseport-multi-process-single-port — general Linux pattern; PgBouncer is the canonical instance here.
patterns/big-pooler-affinity-plus-small-pooler-ha — when latency variability is unacceptable, manually create a single "big" pooler with node-affinity to the database plus a small secondary for HA.

Operational numbers¶

1000 client connections dispatched via 8 threads in the pgbench load.
1 ms ± 0.1 ms normal-distributed netem delay on veth0 to approximate Kubernetes network latency.
2 physical cores × 2 hyperthreads laptop benchmark topology.
<100 MB memory per PgBouncer pod in simple cases — "CPU intensive work with minimal amount of memory."
Pod CPU observation (iptables non-uniformity): 977 m / 995 m / 585 m / 993 m cores across four pods.
Latency ratio: two HT-colocated poolers ≈ 2× the latency of one pooler on an isolated physical core.

Caveats¶

Benchmark is on a laptop with 2 physical cores, not production-scale hardware. The shape of the HT-softirq effect is what generalises; the magnitudes depend on kernel version, NIC driver, IRQ affinity, and actual network card (a veth pair with netem is not a real NIC).
pgbench ; (empty-statement load) is deliberately database-light — this isolates the pooler as the bottleneck but doesn't measure pooler behavior under realistic query load. For heterogeneous workloads the author suggests oltpbench or benchmarksql.
Conclusions are about Zalando's operator opinionation; other Postgres operators (CrunchyData, StackGres, CloudNativePG) make different topology choices.
2020 article; kube-proxy has since grown IPVS mode (more evenly balanced than iptables) and Cilium/eBPF-based replacements change the probabilistic-load-balancing picture. Claims here should be dated to the iptables-era default.