Zalando — PgBouncer on Kubernetes and how to achieve minimal latency¶
Summary¶
Alexander Kukushkin (Zalando, 2020-06-23) explores the latency
implications of running PgBouncer on
Kubernetes as part of the Zalando
Postgres Operator 1.5
connection-pooling feature. Two surprising findings emerge. First,
Kubernetes Service in default iptables
kube-proxy mode
distributes load non-uniformly — one out of four PgBouncer pods
received roughly half the traffic of its peers because iptables
computes per-packet probabilities to land on a backend rather
than enforcing strict round-robin. Second, hyperthreading
significantly inflates PgBouncer latency when two so_reuseport
instances land on sibling hyperthreads of the same physical core:
softirq NET_RX/NET_TX handler latencies climb, echoing the Linux
kernel scaling doc's warning that "for interrupt handling, HT
has shown no benefit in initial tests, so limit the number of
queues to the number of CPU cores in the system." The
operator-level conclusion: if you need tight latency, pin
PgBouncer to a real core via the Kubernetes
CPU Manager static policy;
for everyone else, accept scatter across AZs as the price of
availability. Zalando's Operator ships the pragmatic default —
single pooler Deployment exposed via a Service, distributed
across availability zones — with an escape hatch for
single-AZ affinity when latency variance cannot be tolerated.
Key takeaways¶
-
Why connection poolers at all — Postgres uses a process-per-connection client/server model; too many connections cause CPU fighting, context switches, and CPU migrations. Additionally
GetSnapshotDatain the transaction system has O(connections) complexity, so the cost scales with open connections regardless of activity. The pooler options are: inside the database (proposed patch, unmerged), as a separate component (PgBouncer, Pgpool-II, Odyssey, pgagroal), or on the application side. Zalando picked the separate component path because the application-side option is out of the operator's control and an internal pooler is "a major feature one needs to develop yet." -
PgBouncer selected over Pgpool-II, Odyssey, pgagroal — PgBouncer is "probably the most popular and the oldest"; Pgpool-II "can actually do much more than just connection pooling (e.g. it can do load balancing), but it means it's a bit more heavyweight"; Odyssey and pgagroal are newer and try to be more performance-optimized. "Current implementation allow us to switch to any other solutions if they conform to a basic common standard."
-
Kubernetes Service load distribution is probabilistic, not uniform — observed in production:
"This could happen if kube-proxy works in iptables mode and calculates probabilities to land on a pod instead of strict round-robin." See concepts/kube-proxy-iptables-probability.NAME CPU(cores) MEMORY(bytes) pool-test-7d8bfbc47f-6bbhr 977m 5Mi pool-test-7d8bfbc47f-8jtnp 995m 6Mi pool-test-7d8bfbc47f-ghvpn 585m 6Mi ← half-loaded pool-test-7d8bfbc47f-s945p 993m 6Mi -
Benchmark methodology: network namespace + veth + netem — Kukushkin builds a reproducible low-noise harness on his laptop:
Adds a 1ms ± 0.1ms delay to approximate observed Kubernetes cluster latency.ip link add veth0 type veth peer name veth1 ip netns add db ip link set veth1 netns db ip addr add 10.0.0.10/24 dev veth0 ip netns exec db ip addr add 10.0.0.1/24 dev veth1 tc qdisc add dev veth0 root netem delay 1ms 0.1ms distribution normalpgbenchwith a trivial;query (the smallest valid SQL) — "the idea is to not load the database itself too much and see how PgBouncer instance will handle many connections" — 1000 connections dispatched via 8 threads. CPUs isolated via cpuset, Intel turbo disabled,performancegovernor set. See concepts/network-namespace-benchmarking. -
Three-way CPU-placement experiment on a 2-physical-core / 2-HT-per-core laptop:
- One PgBouncer on an isolated real core — lowest latency.
- Two PgBouncers on isolated hyperthreads of the same physical core — latency "almost two times higher (with somewhat minimal increase in throughput)".
- Two PgBouncers on isolated separate real cores (with potential noise from other components on the other HT) — latency "somewhere in between (with the throughput best of the three)".
-
Root cause: softirq NET_RX / NET_TX latencies degrade on shared hyperthreads. Kukushkin probes
The 99th-percentile softirq latency forirq:softirq_entry/irq:softirq_exittracepoints with Brendan Gregg's perf script:vec == 2 / 3(NET_RX / NET_TX) is higher when both PgBouncers share a physical core. Kernel docs confirm the mechanism: "For interrupt handling, HT has shown no benefit in initial tests, so limit the number of queues to the number of CPU cores in the system." See concepts/hyperthread-softirq-contention. -
Operator-level mitigation: CPU Manager static policy — "it could be beneficial to configure CPU manager in the cluster, so that this would not be an issue." The Kubernetes CPU Manager static policy allows exclusive-CPU pinning via
cpuset, preventing a pool pod from landing on a sibling hyperthread of an already-busy core. See concepts/cpu-manager-static-policy. -
so_reuseportas the PgBouncer scaling primitive — the experiment ran two PgBouncer instances withso_reuseport, "essentially a way to get PgBouncer to use more CPU cores." Multiple processes bind to the same port; the kernel distributes accepts among them. See concepts/so-reuseport-pgbouncer-scaling. -
Zalando Operator's deployment shape (the pragmatic default):
- Single connection-pooler Deployment per Postgres cluster, exposed via a new Service.
- Pooler pods distributed across availability zones.
- Pooler pods are CPU-intensive with low memory (<100 MB simple case) — "it makes sense to create as many as needed to prevent resource saturation."
- Trade-off acknowledged: pods scattered across nodes / AZs → latency variability.
- Escape hatch for latency-sensitive workloads: manually create a single "big" pooler instance with affinity to the same node as the database, configure CPU Manager, use a secondary smaller pooler for HA. See patterns/big-pooler-affinity-plus-small-pooler-ha.
Systems extracted¶
- PgBouncer — the selected pooler; canonical behavior probed on Kubernetes here.
- PostgreSQL — the process-per-connection server whose connection cost motivates pooling.
- Kubernetes — the substrate; Service load-balancing and CPU Manager feature here.
- kube-proxy — iptables-mode load distribution is non-uniform.
- Zalando Postgres Operator — the Kubernetes operator that owns the pooler deployment topology.
- Pgpool-II — alternative pooler, rejected as too heavyweight.
- Odyssey — Yandex's newer pooler, noted.
- pgagroal — another newer pooler, noted.
Concepts extracted¶
- concepts/kube-proxy-iptables-probability — Kubernetes Service in iptables mode computes per-packet probabilities across backends rather than strict round-robin. Canonical datum: one of four PgBouncer pods received ~half the traffic of peers.
- concepts/so-reuseport-pgbouncer-scaling — PgBouncer's
so_reuseportoption lets multiple processes share a port so the pooler can use multiple CPU cores. - concepts/hyperthread-softirq-contention — softirq NET_RX / NET_TX handler latency is higher when two latency-sensitive processes share a physical core via hyperthreads. Kernel-doc-backed; Linux recommends queue count ≤ physical cores.
- concepts/cpu-manager-static-policy — Kubernetes CPU Manager feature that exclusively pins pods to cpusets, avoiding sibling-hyperthread placement problems.
- concepts/network-namespace-benchmarking — reproducible
network-stack experiments via
ip netns+veth+tc netemon a single host. - concepts/getsnapshotdata-o-n — Postgres's transaction- visibility function is O(connections), making the connection count a first-class cost dimension.
- concepts/process-per-connection-postgres — already on wiki; this source re-canonicalises as the root motivation for pooling.
Patterns extracted¶
- patterns/connection-pooler-as-separate-deployment — the canonical shape Zalando picked: one pooler Deployment per Postgres cluster, Service-fronted, AZ-spread.
- patterns/fixed-cpu-pinning-for-latency-sensitive-pool — use CPU Manager static policy + cpuset to pin the pooler off shared hyperthreads.
- patterns/so-reuseport-multi-process-single-port — general Linux pattern; PgBouncer is the canonical instance here.
- patterns/big-pooler-affinity-plus-small-pooler-ha — when latency variability is unacceptable, manually create a single "big" pooler with node-affinity to the database plus a small secondary for HA.
Operational numbers¶
- 1000 client connections dispatched via 8 threads in
the
pgbenchload. - 1 ms ± 0.1 ms normal-distributed netem delay on
veth0to approximate Kubernetes network latency. - 2 physical cores × 2 hyperthreads laptop benchmark topology.
- <100 MB memory per PgBouncer pod in simple cases — "CPU intensive work with minimal amount of memory."
- Pod CPU observation (iptables non-uniformity): 977 m / 995 m / 585 m / 993 m cores across four pods.
- Latency ratio: two HT-colocated poolers ≈ 2× the latency of one pooler on an isolated physical core.
Caveats¶
- Benchmark is on a laptop with 2 physical cores, not production-scale hardware. The shape of the HT-softirq effect is what generalises; the magnitudes depend on kernel version, NIC driver, IRQ affinity, and actual network card (a veth pair with netem is not a real NIC).
pgbench ;(empty-statement load) is deliberately database-light — this isolates the pooler as the bottleneck but doesn't measure pooler behavior under realistic query load. For heterogeneous workloads the author suggests oltpbench or benchmarksql.- Conclusions are about Zalando's operator opinionation; other Postgres operators (CrunchyData, StackGres, CloudNativePG) make different topology choices.
- 2020 article; kube-proxy has since grown IPVS mode (more evenly balanced than iptables) and Cilium/eBPF-based replacements change the probabilistic-load-balancing picture. Claims here should be dated to the iptables-era default.
Source¶
- Original: https://engineering.zalando.com/posts/2020/06/postgresql-connection-poolers.html
- Raw markdown:
raw/zalando/2020-06-23-pgbouncer-on-kubernetes-and-how-to-achieve-minimal-latency-e3b6cdd2.md
Related¶
- systems/pgbouncer · systems/postgresql · systems/kubernetes · systems/kube-proxy · systems/zalando-postgres-operator · systems/pgpool-ii · systems/odyssey · systems/pgagroal
- concepts/kube-proxy-iptables-probability · concepts/so-reuseport-pgbouncer-scaling · concepts/hyperthread-softirq-contention · concepts/cpu-manager-static-policy · concepts/network-namespace-benchmarking · concepts/getsnapshotdata-o-n · concepts/process-per-connection-postgres
- patterns/connection-pooler-as-separate-deployment · patterns/fixed-cpu-pinning-for-latency-sensitive-pool · patterns/so-reuseport-multi-process-single-port · patterns/big-pooler-affinity-plus-small-pooler-ha
- companies/zalando