Skip to content

PATTERN Cited by 1 source

Torchvision over PIL image processing

Pattern

For multimodal LLM serving, explicitly select the Torchvision-based image processor at model-load time when both PIL- and Torchvision- backed processors are available. The Torchvision path is ~10× faster on resize+normalise; the PIL default for some Hugging Face models is slow enough to block the inference engine's event loop.

Canonical wiki disclosure (Source: sources/2026-05-27-databricks-reliable-llm-inference-at-scale):

"Among all CPU operations for images, image processing (resizing and normalization) is 10x slower than other operations like base64 decoding. Some Hugging Face models default to the PIL-based image processor, while others use the faster Torchvision-based processor."

"By switching to Torchvision-based image processors and properly configuring OMP_NUM_THREADS, we sustained much higher QPS and fully leveraged the GPUs. After the fix shipped, requests completed per second jumped > 3x with the same replicas and load."

When to use it

  • Serving Hugging Face Transformers models with image inputs (vision-language models, multimodal LLMs).
  • Production inference engines (vLLM and similar) where the front- end runs on Python's async event loop and a slow per-request preprocessing step blocks the loop.
  • Workloads where image preprocessing happens on CPU per-request before the GPU forward pass — the standard configuration today.

When NOT to use it

  • Models where a custom GPU-side preprocessing pipeline (e.g. NVIDIA DALI, custom CUDA preprocessing) replaces the Hugging Face image processor entirely.
  • Workloads where image inputs are pre-processed offline and cached — the per-request preprocessing cost is amortised away.
  • Models that don't expose a Torchvision-backed alternative; some niche models only have PIL implementations.

Mechanics

The Hugging Face Transformers library has a base ImageProcessor class with multiple backends. Modern image processors typically have two implementations:

# Default (may be PIL-backed for some models)
processor = AutoImageProcessor.from_pretrained(model_name)

# Force Torchvision-backed (faster)
processor = AutoImageProcessor.from_pretrained(
    model_name, use_fast=True
)

The use_fast=True flag (or equivalent on the specific image- processor class) selects the Torchvision-backed implementation. As of 2026, this defaults to False for some models — the override is required explicitly, per model, per deployment.

The Torchvision-backed processor uses torchvision.transforms.v2, which dispatches to optimised native code for resize, normalise, crop, and similar operations. PIL goes through Python-level wrapping of libjpeg-turbo etc. with more per-call overhead. The 10× cost difference comes from avoided Python overhead and from operating on torch tensors directly without the PIL Image bridge.

Composition with OMP_NUM_THREADS fix

The 2026-05-27 disclosure pairs two independent fixes that together delivered >3× RPS:

  1. This pattern: switch to Torchvision processor.
  2. Fix OMP_NUM_THREADS to match container CPU quota.

Both are required because:

  • Torchvision dispatches to native Torch operations that respect OMP_NUM_THREADS for parallelism.
  • If OMP_NUM_THREADS is wrong, even a fast Torchvision path runs oversubscribed threads inside the container's CPU quota, triggering CPU throttling.
  • If Torchvision isn't selected, the PIL path bypasses the OMP thread pool entirely but is slow per-request.

The combined fix delivers fast preprocessing and correct thread sizing, leaving CPU-side preprocessing fast enough to feed the GPU without blocking.

Why the default matters operationally

The structural lesson: library defaults can hide order-of- magnitude performance bugs. The Hugging Face library provides a fast processor, but it's opt-in for some models. A team deploying the model without a profile pass would never know the slow path is running until traffic load surfaces it.

Operational discipline:

  • Profile every multimodal model on a representative request pattern before production rollout.
  • Check explicitly which image-processor backend is selected via the model's processor configuration (processor.__class__ or similar).
  • Set use_fast=True defensively at every model-load site, even if the default is already fast — the default may change in a model update.

This is one instance of a broader class:

Library / Default Issue Fix
HF Transformers PIL processor 10× slower than Torchvision use_fast=True
HF tokenizers (slow Python path) 10-100× slower than Rust path use_fast=True on tokeniser load
Pandas read_csv default engine C engine is faster than Python engine="c" (default since 1.x but worth verifying)
NumPy default integer type platform-dependent — int32 on Windows, int64 on Linux Explicit dtype=

A robust deployment audit checks all of these explicitly.

Risks and mitigations

  • Torchvision processor produces subtly different output than PIL — different rounding / interpolation default. Mitigation: validate model accuracy on a held-out eval set before production rollout.
  • use_fast=True not supported on the specific model — newer or bespoke models may not have the Torchvision path. Mitigation: profile to confirm; fall back to PIL if necessary and accept the cost or move preprocessing to a separate process pool.
  • Future Torchvision API changes break the override. Mitigation: pin Torchvision version in the deployment manifest.

Open questions

  • Which specific Hugging Face models default to PIL vs Torchvision — the post does not enumerate.
  • Eval-quality impact of the processor swap — Databricks does not disclose whether the swap was verified to be model-quality neutral.
  • GPU-side preprocessing as a future direction (NVIDIA DALI, custom CUDA kernels) — not addressed in the post.

Seen in

Last updated · 542 distilled / 1,571 read