SYSTEM Cited by 1 source
Hugging Face Inference¶
Hugging Face Inference (HF Inference / transformers-based
inference) is the family of HuggingFace serving options built on
the transformers library — pipeline() helpers in-process, the
Hugging Face Inference API endpoint, and Text Generation Inference
(TGI) / Text Embeddings Inference (TEI) server products. Widely
used as the default first-pass inference path when teams stand
up a transformer-serving service, because the same model files that
come down from huggingface.co work immediately.
Properties relevant to system design (as the baseline)¶
- Classical
(B, S_max)batching in default configurations — sequences in a batch are padded to the longest, inference time tracksB × S_maxrather thanΣ token_count. - Per-request serving — the common deployment path is one
request in / one response out; batching support varies by server
flavour and is generally not the aggressive continuous /
variable-length batching that
vLLM/SGLangdefault to. - Highly flexible, less optimised — trades raw throughput for immediate model-family compatibility.
Role on the wiki¶
On the wiki, "Hugging Face Inference" appears as the baseline against which more-optimised engines are measured for production transformer serving. vLLM and SGLang both improve on it structurally via padding removal, continuous batching, and memory-aware KV-cache management.
Seen in¶
- 2025-12-18 Voyage AI / MongoDB — Token-count-based batching — named as the old pipeline for query embedding serving: "no batching + Hugging Face Inference" — against which the new vLLM + token-count-batched pipeline produced 50 % GPU-inference- latency reduction with 3× fewer GPUs on voyage-3-large. Engine switch alone (HF → vLLM) accounts for up to ~20 ms GPU-inference- time reduction per model; batching layered on top drives the 8× throughput gain. (sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference)
Stub — no deeper HF-Inference architectural post yet ingested.