Skip to content

CONCEPT Cited by 1 source

OMP_NUM_THREADS container misconfiguration

Definition

The OMP_NUM_THREADS container misconfiguration is a multi-tenant Kubernetes / container-runtime pathology where libraries that read the OMP_NUM_THREADS environment variable (PyTorch, NumPy, OpenBLAS, and many other native math kernels) default to the host's vCPU count rather than the container's CPU quota, causing thread oversubscription, cgroup CPU throttling, and degraded performance under load.

Canonical wiki disclosure (Source: sources/2026-05-27-databricks-reliable-llm-inference-at-scale):

"In containerized environments, OMP_NUM_THREADS (which controls the number of OpenMP threads used by Torch for CPU operations) defaults to the number of vCPUs on the host machine. In multitenant setups, this is a poor default: a host might have 192 vCPUs, but a container only has access to 12. The result is far more running threads than available cores. This drives CPU usage past the container's limit and triggers throttling."

The structural mechanism

Several layers participate in this misconfiguration:

  1. OpenMP library default behavior: When OMP_NUM_THREADS is unset, OpenMP queries the system for the number of "available" processors via sysconf(_SC_NPROCESSORS_ONLN) or similar. This call returns the host's vCPU count, not the container's CPU quota.
  2. PyTorch picks this up: PyTorch's torch.set_num_threads() defaults to the OpenMP-reported value if no override is set. So does numpy for BLAS-backed operations.
  3. Container has CPU quota via cgroup cpu.cfs_quota_us / cpu.cfs_period_us (Kubernetes resources.limits.cpu). The container is allowed only N CPU-cores' worth of work per period.
  4. Threads exceed cores by 16× (192 host vCPUs / 12 container vCPUs in the disclosed Databricks case). The kernel runs them round-robin within the container's quota.
  5. CPU throttling fires: when threads hit the quota, the kernel pauses them via cgroup throttling, visible in cpu.stat as nr_throttled / throttled_time.
  6. No useful work done: threads context-switch between each other, fighting for the same cores, with cache-line invalidation and scheduler overhead eating the available compute.

The pathology is invisible to most monitoring:

  • Host CPU% is not high (the host has spare cores).
  • Container CPU% can look near-100% but doesn't hint at the cause.
  • Per-thread CPU% is low because they're all blocked.
  • Application latency is high but with no obvious smoking gun.

Why it's specifically a multi-tenant problem

This bug only manifests as a production-impacting problem under two conditions that compose:

  1. Multi-tenant container scheduling: the container's CPU quota is much smaller than the host's CPU count. On a single-tenant host this is rarely true.
  2. CPU-intensive workload: text-only LLM inference where forward-pass dominates barely cares; multimodal preprocessing (where each image traverses several BLAS / OpenMP-backed paths per request) gets killed by it.

Together these two conditions describe multi-tenant LLM serving on Kubernetes with multimodal traffic — the exact regime Databricks reports.

Why "moving to threads/processes didn't solve it"

The Databricks post notes that moving blocking operations into separate threads and processes did not fix multimodal request pile-up. The reason is precisely this misconfiguration: spawning more processes inside the same container adds more OpenMP threads to the same CPU quota pool, making the throttling worse, not better. Process-level parallelism is independent of thread-level oversubscription — until the per-process OMP_NUM_THREADS is fixed, adding processes does nothing.

The fix

"By switching to Torchvision-based image processors and properly configuring OMP_NUM_THREADS, we sustained much higher QPS and fully leveraged the GPUs."

The mechanical fix is to set OMP_NUM_THREADS to a value matching the container's CPU quota:

# In the Dockerfile or container-launch configuration
ENV OMP_NUM_THREADS=12  # match container's cpu.limits

Or, programmatically, at process start:

import os
import torch
# Resolve container CPU quota from cgroup
quota = read_cgroup_cpu_quota()  # implementation varies by cgroup version
os.environ["OMP_NUM_THREADS"] = str(quota)
torch.set_num_threads(quota)

This single change combined with the PIL→Torchvision swap delivered >3× RPS on the same hardware in the Databricks deployment.

Siblings — other env-vars in the same family

OMP_NUM_THREADS is the most-cited example, but the same pattern applies to other libraries:

Variable Library Default behaviour
OMP_NUM_THREADS OpenMP / PyTorch / NumPy / many others host vCPUs
OPENBLAS_NUM_THREADS OpenBLAS host vCPUs
MKL_NUM_THREADS Intel MKL host vCPUs
BLIS_NUM_THREADS BLIS BLAS host vCPUs
NUMEXPR_NUM_THREADS NumExpr host vCPUs
TF_NUM_INTEROP_THREADS / TF_NUM_INTRAOP_THREADS TensorFlow host vCPUs
RAYON_NUM_THREADS Rust rayon (in NumPy 2 backends) host CPUs

A robust container build sets all of them at once. The structural fix is container-aware defaults in the underlying libraries — some have made progress (e.g. PyTorch detects cgroup limits in newer versions for some operations), but full coverage is incomplete and manual override remains necessary as of 2026.

Generalised lesson — "container-host CPU asymmetry"

This is one instance of a broader class of failures: software written on the assumption that the OS reports correct CPU/memory/etc. for the running unit-of-execution, when in fact the OS reports host-level numbers and the unit is constrained at a layer above (cgroup, container, VM). Other manifestations:

  • runtime.GOMAXPROCS() in Go programs (similar issue; automaxprocs library is the canonical fix).
  • JVM heap-size auto-tuning historically read host RAM, not container limit (now usually fixed via -XX:MaxRAMPercentage or newer JVM container-aware mode).
  • Memory-pool allocation in NumPy / NumExpr (similar).

The common-cause framing: "runtime introspection of resources must ask the right authority" — not the host kernel, but the container runtime (cgroup) when running in a container.

Composition

  • Above in the failure chain: multimodal CPU bottleneck surfaces this misconfiguration because multimodal preprocessing is the workload that maximally exposes it.
  • Adjacent: concepts/noisy-neighbor — host-vs-container CPU asymmetry is a noisy-neighbor amplifier in multi-tenant deployments.
  • Adjacent: concepts/tenant-isolation — fixing OMP_NUM_THREADS per-container is part of correct tenant isolation in Kubernetes-based ML serving.

Open questions

  • Why do major Python ML libraries still default this way in 2026? The container-aware-defaults problem has been known for years; mitigation is library-by-library and still incomplete.
  • What's the right value to set? Match container quota exactly, or quota - 1 for system overhead, or some fraction? Not specified in the post.
  • Detection in the wild: profiling-friendly tools to flag this specific misconfiguration vs other CPU-throttling causes — none named.

Seen in

Last updated · 542 distilled / 1,571 read