SYSTEM Cited by 1 source
Infire (Cloudflare inference engine)¶
Overview¶
Infire is Cloudflare's proprietary inference engine written in Rust, announced at Birthday Week 2025 ("Cloudflare's most efficient AI inference engine") and extended in 2026-04-16 with multi-GPU support for running extra-large language models like Kimi K2.5 on Workers AI. "Cloudflare has a proprietary inference engine, Infire, that makes machine learning models faster. Infire is an inference engine written in Rust, designed to support Cloudflare's unique challenges with inference given our distributed global network."
Role¶
Infire is the model-execution tier of Workers AI. Competitors in the design space include vLLM, TensorRT-LLM, SGLang, and Hugging Face TGI. Cloudflare's stated reason to own the engine: "enables us to maximize our hardware by getting up to 20% higher tokens per second throughput on unconstrained systems, and also enabling us to use lower-end hardware to run the latest models, where it was previously completely infeasible."
Multi-GPU parallelism (2026-04 addition)¶
Infire supports three parallelism axes, typically combined:
- Pipeline parallelism — different transformer layers live on different GPUs; requests flow through the pipeline. "Infire attempts to properly load balance all stages of the pipeline, in order to prevent the GPUs of one stage from starving while other stages are executing."
- Tensor parallelism — individual weight matrices split across GPUs, each GPU owning a slice; activations are all-reduced across GPUs per layer. "Infire optimizes for reducing cross-GPU communication, making it as fast as possible."
- Expert parallelism — for MoE (mixture-of-experts) models, different experts live on different GPUs; activations routed to whichever GPU hosts the chosen expert.
"For most models, utilizing both pipeline parallelism and tensor parallelism in tandem provides the best balance of throughput and latency." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Lower memory overhead than baseline¶
Infire was already lower-memory-overhead than vLLM at the time of the 2026-04-16 post; further work tightened activation / internal-state memory:
| Model | Hardware | KV cache free | Capacity |
|---|---|---|---|
| Llama 4 Scout | 2× H200 | >56 GiB | >1.2M tokens |
| Kimi K2.5 (>1T params, ~560 GB weights) | 8× H100 (not H200) | >30 GiB | — |
"In both cases you would have trouble even booting vLLM in the first place." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Boot time¶
"Even for the largest models, such as Kimi K2.5, Infire can begin serving requests in under 20 seconds. The load times are only bounded by the drive speed." Cold-boot-under-20s matters for autoscaling: a new replica can absorb a traffic burst before the burst ends. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Throughput headline¶
"Investing in our proprietary inference engine enables us to maximize our hardware by getting up to 20% higher tokens per second throughput on unconstrained systems, and also enabling us to use lower-end hardware to run the latest models, where it was previously completely infeasible." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Related primitives in the Workers AI serving stack¶
- PD disaggregation — Infire is the engine running on both the prefill and decode tiers; the load balancer and KV-transfer substrate sit above it.
- KV cache + Mooncake Transfer Engine — Infire instances share KV cache cluster-wide via RDMA; not explicitly stated whether the Mooncake integration is inside Infire or a sibling service.
- Speculative decoding + EAGLE-3 — another latency optimisation active during decode on Kimi K2.5.
x-session-affinity— the client-signal routing layer above Infire that keeps same-session requests warm on the same KV cache.
Design posture¶
Rust-first on serving-infra mirrors Cloudflare's broader Rust posture (systems/pingora, systems/trie-hard). The Rust choice implies:
- Memory-safe systems programming — avoids the C++ UB surface that haunts long-lived serving daemons.
- Manual memory control without GC jitter in the hot path.
- First-class interop with CUDA / ROCm / TPU via thin FFI layers (not detailed in the post).
Caveats¶
- Not open source at time of writing — unlike systems/pingora or systems/trie-hard, Infire is a proprietary engine.
- vLLM comparison is sided. The Infire-vs-vLLM framing and the 20% throughput figure come from Cloudflare; no third-party benchmark methodology.
- "20% higher tok/s on unconstrained systems" lacks baseline specification (vs vLLM? vs earlier Infire? which model/hardware triple?) and workload specification.
- Boot-time "bounded by drive speed" un-benchmarked. No storage-substrate details (local NVMe? networked? per-node cache pre-warm?).
- Multi-GPU parallelism described at the "what", not the "how" — no detail on all-reduce algorithm, scheduling, or overlap of communication with compute.
Complementary: Unweight weight compression¶
Where Infire's discipline is activation / internal-state memory, Unweight (2026-04-17) attacks the weight side of the same VRAM budget: lossless Huffman coding on BF16 exponent bytes yields ~22 % model-size reduction on Llama-3.1-8B (~3 GB VRAM saved per instance) via fused reconstruction directly into tensor cores. Savings from the two systems are additive — every byte Unweight saves on MLP weights is a byte of KV-cache or serving-capacity headroom Infire can use. Both are Cloudflare-proprietary; Unweight ships open-source kernels (systems/unweight-kernels). (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)
Seen in¶
- sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — multi-GPU extension + memory-overhead + boot-time + throughput numbers. Original 2025-Birthday-Week announcement is linked but pre-dates this wiki.
- sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality — cross-reference: weight-side memory-footprint reduction via lossless Huffman compression on BF16 exponents, complementary to Infire's activation-memory discipline.
Related¶
- systems/workers-ai — parent serving platform.
- systems/unweight — complementary weight-side memory lever; Unweight attacks the weights, Infire attacks the activations.
- systems/kimi-k2-5 — canonical extra-large model Infire now serves.
- systems/vllm — the reference baseline Cloudflare measures against.
- systems/sglang — sibling inference engine; Mooncake integration in Workers AI uses SGLang HiCache as an option alongside LMCache.
- systems/mooncake-transfer-engine — the KV-cache transfer substrate between Infire instances.
- concepts/tensor-parallelism / concepts/pipeline-parallelism / concepts/expert-parallelism — parallelism axes Infire now supports.
- concepts/kv-cache — the structural reason multi-GPU memory budget dominates design.
- concepts/multi-gpu-serving — broader context.
- companies/cloudflare