CONCEPT Cited by 1 source
OMP_NUM_THREADS container misconfiguration¶
Definition¶
The OMP_NUM_THREADS container misconfiguration is a multi-tenant
Kubernetes / container-runtime pathology where libraries that read the
OMP_NUM_THREADS environment variable (PyTorch, NumPy, OpenBLAS, and
many other native math kernels) default to the host's vCPU count
rather than the container's CPU quota, causing thread
oversubscription, cgroup CPU throttling, and degraded performance
under load.
Canonical wiki disclosure (Source: sources/2026-05-27-databricks-reliable-llm-inference-at-scale):
"In containerized environments, OMP_NUM_THREADS (which controls the number of OpenMP threads used by Torch for CPU operations) defaults to the number of vCPUs on the host machine. In multitenant setups, this is a poor default: a host might have 192 vCPUs, but a container only has access to 12. The result is far more running threads than available cores. This drives CPU usage past the container's limit and triggers throttling."
The structural mechanism¶
Several layers participate in this misconfiguration:
- OpenMP library default behavior: When
OMP_NUM_THREADSis unset, OpenMP queries the system for the number of "available" processors viasysconf(_SC_NPROCESSORS_ONLN)or similar. This call returns the host's vCPU count, not the container's CPU quota. - PyTorch picks this up: PyTorch's
torch.set_num_threads()defaults to the OpenMP-reported value if no override is set. So doesnumpyfor BLAS-backed operations. - Container has CPU quota via cgroup
cpu.cfs_quota_us/cpu.cfs_period_us(Kubernetesresources.limits.cpu). The container is allowed only N CPU-cores' worth of work per period. - Threads exceed cores by 16× (192 host vCPUs / 12 container vCPUs in the disclosed Databricks case). The kernel runs them round-robin within the container's quota.
- CPU throttling fires: when threads hit the quota, the kernel
pauses them via cgroup throttling, visible in
cpu.statasnr_throttled/throttled_time. - No useful work done: threads context-switch between each other, fighting for the same cores, with cache-line invalidation and scheduler overhead eating the available compute.
The pathology is invisible to most monitoring:
- Host CPU% is not high (the host has spare cores).
- Container CPU% can look near-100% but doesn't hint at the cause.
- Per-thread CPU% is low because they're all blocked.
- Application latency is high but with no obvious smoking gun.
Why it's specifically a multi-tenant problem¶
This bug only manifests as a production-impacting problem under two conditions that compose:
- Multi-tenant container scheduling: the container's CPU quota is much smaller than the host's CPU count. On a single-tenant host this is rarely true.
- CPU-intensive workload: text-only LLM inference where forward-pass dominates barely cares; multimodal preprocessing (where each image traverses several BLAS / OpenMP-backed paths per request) gets killed by it.
Together these two conditions describe multi-tenant LLM serving on Kubernetes with multimodal traffic — the exact regime Databricks reports.
Why "moving to threads/processes didn't solve it"¶
The Databricks post notes that moving blocking operations into separate threads and processes did not fix multimodal request pile-up. The reason is precisely this misconfiguration: spawning more processes inside the same container adds more OpenMP threads to the same CPU quota pool, making the throttling worse, not better. Process-level parallelism is independent of thread-level oversubscription — until the per-process OMP_NUM_THREADS is fixed, adding processes does nothing.
The fix¶
"By switching to Torchvision-based image processors and properly configuring OMP_NUM_THREADS, we sustained much higher QPS and fully leveraged the GPUs."
The mechanical fix is to set OMP_NUM_THREADS to a value matching
the container's CPU quota:
# In the Dockerfile or container-launch configuration
ENV OMP_NUM_THREADS=12 # match container's cpu.limits
Or, programmatically, at process start:
import os
import torch
# Resolve container CPU quota from cgroup
quota = read_cgroup_cpu_quota() # implementation varies by cgroup version
os.environ["OMP_NUM_THREADS"] = str(quota)
torch.set_num_threads(quota)
This single change combined with the PIL→Torchvision swap delivered >3× RPS on the same hardware in the Databricks deployment.
Siblings — other env-vars in the same family¶
OMP_NUM_THREADS is the most-cited example, but the same pattern
applies to other libraries:
| Variable | Library | Default behaviour |
|---|---|---|
OMP_NUM_THREADS |
OpenMP / PyTorch / NumPy / many others | host vCPUs |
OPENBLAS_NUM_THREADS |
OpenBLAS | host vCPUs |
MKL_NUM_THREADS |
Intel MKL | host vCPUs |
BLIS_NUM_THREADS |
BLIS BLAS | host vCPUs |
NUMEXPR_NUM_THREADS |
NumExpr | host vCPUs |
TF_NUM_INTEROP_THREADS / TF_NUM_INTRAOP_THREADS |
TensorFlow | host vCPUs |
RAYON_NUM_THREADS |
Rust rayon (in NumPy 2 backends) | host CPUs |
A robust container build sets all of them at once. The structural fix is container-aware defaults in the underlying libraries — some have made progress (e.g. PyTorch detects cgroup limits in newer versions for some operations), but full coverage is incomplete and manual override remains necessary as of 2026.
Generalised lesson — "container-host CPU asymmetry"¶
This is one instance of a broader class of failures: software written on the assumption that the OS reports correct CPU/memory/etc. for the running unit-of-execution, when in fact the OS reports host-level numbers and the unit is constrained at a layer above (cgroup, container, VM). Other manifestations:
runtime.GOMAXPROCS()in Go programs (similar issue;automaxprocslibrary is the canonical fix).- JVM heap-size auto-tuning historically read host RAM, not
container limit (now usually fixed via
-XX:MaxRAMPercentageor newer JVM container-aware mode). - Memory-pool allocation in NumPy / NumExpr (similar).
The common-cause framing: "runtime introspection of resources must ask the right authority" — not the host kernel, but the container runtime (cgroup) when running in a container.
Composition¶
- Above in the failure chain: multimodal CPU bottleneck surfaces this misconfiguration because multimodal preprocessing is the workload that maximally exposes it.
- Adjacent: concepts/noisy-neighbor — host-vs-container CPU asymmetry is a noisy-neighbor amplifier in multi-tenant deployments.
- Adjacent: concepts/tenant-isolation — fixing
OMP_NUM_THREADSper-container is part of correct tenant isolation in Kubernetes-based ML serving.
Open questions¶
- Why do major Python ML libraries still default this way in 2026? The container-aware-defaults problem has been known for years; mitigation is library-by-library and still incomplete.
- What's the right value to set? Match container quota exactly, or quota - 1 for system overhead, or some fraction? Not specified in the post.
- Detection in the wild: profiling-friendly tools to flag this specific misconfiguration vs other CPU-throttling causes — none named.
Seen in¶
- sources/2026-05-27-databricks-reliable-llm-inference-at-scale — first canonical wiki disclosure of OMP_NUM_THREADS host-vs- container default as a load-bearing multi-tenant LLM-serving pathology. 192 host vCPUs / 12 container vCPUs worked example.
Related¶
- concepts/multimodal-cpu-bottleneck — the failure mode that surfaces this misconfiguration.
- concepts/cpu-bound-serving-small-fast-model — adjacent CPU-bound regime.
- concepts/noisy-neighbor — host-vs-container CPU asymmetry amplifies noisy-neighbor in multi-tenant clusters.
- concepts/tenant-isolation — correct per-container thread caps are a piece of tenant isolation.
- systems/kubernetes — the substrate where the cgroup quotas live.
- systems/pytorch — the library most affected (per the post).
- patterns/torchvision-over-pil-image-processing — the companion fix shipped together.
- patterns/multiprocessing-runtime-for-cpu-bound-serving — the adjacent CPU-bound-fix pattern that does not help if OMP_NUM_THREADS is wrong.
- systems/databricks-model-serving — the platform context.