CONCEPT Cited by 1 source
Hyperthread softirq contention¶
Definition¶
When two latency-sensitive processes are scheduled on
sibling hyperthreads of the same physical CPU core, the
kernel's softirq handlers — particularly NET_RX
(vector 3) and NET_TX (vector 2) — run with measurably
higher per-invocation latency than when the two processes
sit on different physical cores. This translates directly into
higher application-level p99 latency for network-heavy
workloads.
The Linux kernel scaling docs make the recommendation explicit: "For interrupt handling, HT has shown no benefit in initial tests, so limit the number of queues to the number of CPU cores in the system."
Mechanism¶
- Hyperthreads on a single physical core share execution units, L1/L2 caches, branch predictors, and TLB.
- Softirq handlers are short, spiky, cache-sensitive — they touch the ring buffer, traverse socket bookkeeping, run protocol-stack code paths.
- When a user-space process on the sibling thread is actively running (especially if it also touches network state), softirq handlers contend for the shared microarchitectural resources.
- Net effect: the softirq handler takes longer to complete, packet delivery to user space is delayed, and application p99 latency grows.
Evidence in the wild¶
Zalando's PgBouncer experiment:
| CPU placement | Observed latency |
|---|---|
| One PgBouncer on isolated physical core | Lowest |
| Two PgBouncers on sibling HTs of one physical core | ~2× higher than baseline |
| Two PgBouncers on two separate physical cores | Middle (with modest noise from other HT) |
Per-softirq latency measurement via
irq:softirq_entry / irq:softirq_exit tracepoints
confirmed higher 99th-percentile softirq latency in the
shared-physical-core case — the root cause behind the
application-level latency degradation.
Mitigations¶
- Pin latency-sensitive processes to physical cores only,
not to specific hyperthreads — use
tasksetor cgroup cpuset; on Kubernetes, the CPU Manager static policy handles this automatically. - Disable hyperthreading on the host — the brute-force option, trading throughput for latency consistency.
- Align NIC queue count with physical-core count per the kernel doc recommendation; RSS/RPS should not create more queues than physical cores if interrupt latency matters.
Why it matters¶
For database connection poolers, VoIP / media gateways, high-frequency-trading gateways, and low-latency service meshes, the 2× p99 bump from landing on the wrong hyperthread can blow an SLO. The effect is invisible in average metrics — average throughput looks fine; only the tail of the latency distribution reveals the problem.
Seen in¶
- sources/2020-06-23-zalando-pgbouncer-on-kubernetes-minimal-latency
— first-person reproduction with
perf record -e irq:softirq_entry,irq:softirq_exitand Brendan Gregg's latency extraction script. The article contains both the application-level (pgbench) and kernel-level (perf) evidence.
Related¶
- concepts/cpu-manager-static-policy — the Kubernetes-level mitigation.
- concepts/so-reuseport-pgbouncer-scaling — the mechanism by which two PgBouncer processes end up on the same host, making the HT placement question relevant.