PATTERN Cited by 1 source
Torchvision over PIL image processing¶
Pattern¶
For multimodal LLM serving, explicitly select the Torchvision-based image processor at model-load time when both PIL- and Torchvision- backed processors are available. The Torchvision path is ~10× faster on resize+normalise; the PIL default for some Hugging Face models is slow enough to block the inference engine's event loop.
Canonical wiki disclosure (Source: sources/2026-05-27-databricks-reliable-llm-inference-at-scale):
"Among all CPU operations for images, image processing (resizing and normalization) is 10x slower than other operations like base64 decoding. Some Hugging Face models default to the PIL-based image processor, while others use the faster Torchvision-based processor."
"By switching to Torchvision-based image processors and properly configuring OMP_NUM_THREADS, we sustained much higher QPS and fully leveraged the GPUs. After the fix shipped, requests completed per second jumped > 3x with the same replicas and load."
When to use it¶
- Serving Hugging Face Transformers models with image inputs (vision-language models, multimodal LLMs).
- Production inference engines (vLLM and similar) where the front- end runs on Python's async event loop and a slow per-request preprocessing step blocks the loop.
- Workloads where image preprocessing happens on CPU per-request before the GPU forward pass — the standard configuration today.
When NOT to use it¶
- Models where a custom GPU-side preprocessing pipeline (e.g. NVIDIA DALI, custom CUDA preprocessing) replaces the Hugging Face image processor entirely.
- Workloads where image inputs are pre-processed offline and cached — the per-request preprocessing cost is amortised away.
- Models that don't expose a Torchvision-backed alternative; some niche models only have PIL implementations.
Mechanics¶
The Hugging Face Transformers library has a base ImageProcessor
class with multiple backends. Modern image processors typically have
two implementations:
# Default (may be PIL-backed for some models)
processor = AutoImageProcessor.from_pretrained(model_name)
# Force Torchvision-backed (faster)
processor = AutoImageProcessor.from_pretrained(
model_name, use_fast=True
)
The use_fast=True flag (or equivalent on the specific image-
processor class) selects the Torchvision-backed implementation. As
of 2026, this defaults to False for some models — the override is
required explicitly, per model, per deployment.
The Torchvision-backed processor uses
torchvision.transforms.v2, which dispatches to optimised native
code for resize, normalise, crop, and similar operations. PIL goes
through Python-level wrapping of libjpeg-turbo etc. with more
per-call overhead. The 10× cost difference comes from avoided
Python overhead and from operating on torch tensors directly
without the PIL Image bridge.
Composition with OMP_NUM_THREADS fix¶
The 2026-05-27 disclosure pairs two independent fixes that together delivered >3× RPS:
- This pattern: switch to Torchvision processor.
- Fix OMP_NUM_THREADS to match container CPU quota.
Both are required because:
- Torchvision dispatches to native Torch operations that respect OMP_NUM_THREADS for parallelism.
- If OMP_NUM_THREADS is wrong, even a fast Torchvision path runs oversubscribed threads inside the container's CPU quota, triggering CPU throttling.
- If Torchvision isn't selected, the PIL path bypasses the OMP thread pool entirely but is slow per-request.
The combined fix delivers fast preprocessing and correct thread sizing, leaving CPU-side preprocessing fast enough to feed the GPU without blocking.
Why the default matters operationally¶
The structural lesson: library defaults can hide order-of- magnitude performance bugs. The Hugging Face library provides a fast processor, but it's opt-in for some models. A team deploying the model without a profile pass would never know the slow path is running until traffic load surfaces it.
Operational discipline:
- Profile every multimodal model on a representative request pattern before production rollout.
- Check explicitly which image-processor backend is selected
via the model's processor configuration (
processor.__class__or similar). - Set
use_fast=Truedefensively at every model-load site, even if the default is already fast — the default may change in a model update.
Related slow-default cases¶
This is one instance of a broader class:
| Library / Default | Issue | Fix |
|---|---|---|
| HF Transformers PIL processor | 10× slower than Torchvision | use_fast=True |
| HF tokenizers (slow Python path) | 10-100× slower than Rust path | use_fast=True on tokeniser load |
| Pandas read_csv default engine | C engine is faster than Python | engine="c" (default since 1.x but worth verifying) |
| NumPy default integer type | platform-dependent — int32 on Windows, int64 on Linux | Explicit dtype= |
A robust deployment audit checks all of these explicitly.
Risks and mitigations¶
- Torchvision processor produces subtly different output than PIL — different rounding / interpolation default. Mitigation: validate model accuracy on a held-out eval set before production rollout.
use_fast=Truenot supported on the specific model — newer or bespoke models may not have the Torchvision path. Mitigation: profile to confirm; fall back to PIL if necessary and accept the cost or move preprocessing to a separate process pool.- Future Torchvision API changes break the override. Mitigation: pin Torchvision version in the deployment manifest.
Open questions¶
- Which specific Hugging Face models default to PIL vs Torchvision — the post does not enumerate.
- Eval-quality impact of the processor swap — Databricks does not disclose whether the swap was verified to be model-quality neutral.
- GPU-side preprocessing as a future direction (NVIDIA DALI, custom CUDA kernels) — not addressed in the post.
Seen in¶
- sources/2026-05-27-databricks-reliable-llm-inference-at-scale — canonical wiki disclosure of Torchvision-over-PIL as a load- bearing performance fix for multimodal LLM serving on Databricks Model Serving. >3× RPS jump on same hardware when paired with the OMP_NUM_THREADS fix.
Related¶
- concepts/multimodal-cpu-bottleneck — the failure mode this pattern fixes.
- concepts/omp-num-threads-container-misconfiguration — the companion fix shipped together.
- concepts/cpu-bound-serving-small-fast-model — adjacent regime.
- patterns/multiprocessing-runtime-for-cpu-bound-serving — the adjacent CPU-bound-fix pattern that does not solve this case alone.
- systems/pytorch / Torchvision — the substrate.
- systems/databricks-model-serving — the deployment context.
- systems/vllm — the open-source engine class this applies to.