SYSTEM Cited by 1 source

Hugging Face Inference¶

Hugging Face Inference (HF Inference / transformers-based inference) is the family of HuggingFace serving options built on the transformers library — pipeline() helpers in-process, the Hugging Face Inference API endpoint, and Text Generation Inference (TGI) / Text Embeddings Inference (TEI) server products. Widely used as the default first-pass inference path when teams stand up a transformer-serving service, because the same model files that come down from huggingface.co work immediately.

Properties relevant to system design (as the baseline)¶

Classical (B, S_max) batching in default configurations — sequences in a batch are padded to the longest, inference time tracks B × S_max rather than Σ token_count.
Per-request serving — the common deployment path is one request in / one response out; batching support varies by server flavour and is generally not the aggressive continuous / variable-length batching that vLLM / SGLang default to.
Highly flexible, less optimised — trades raw throughput for immediate model-family compatibility.

Role on the wiki¶

On the wiki, "Hugging Face Inference" appears as the baseline against which more-optimised engines are measured for production transformer serving. vLLM and SGLang both improve on it structurally via padding removal, continuous batching, and memory-aware KV-cache management.

Seen in¶

2025-12-18 Voyage AI / MongoDB — Token-count-based batching — named as the old pipeline for query embedding serving: "no batching + Hugging Face Inference" — against which the new vLLM + token-count-batched pipeline produced 50 % GPU-inference- latency reduction with 3× fewer GPUs on voyage-3-large. Engine switch alone (HF → vLLM) accounts for up to ~20 ms GPU-inference- time reduction per model; batching layered on top drives the 8× throughput gain. (sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference)

Stub — no deeper HF-Inference architectural post yet ingested.