SYSTEM Cited by 1 source
TensorRT-LLM¶
Definition¶
TensorRT-LLM is NVIDIA's open-source library for high-performance inference of large language models on NVIDIA GPUs. It compiles Transformer models (encoder, decoder, encoder-decoder) into optimised GPU-runnable engines, with primitives tailored to autoregressive decoding workloads — KV-cache management, in-flight batching, beam search, speculative decoding, FP8 / INT4 quantisation, and tensor / pipeline parallelism.
In production deployments TensorRT-LLM is typically paired with the NVIDIA Triton Inference Server — Triton hosts the request-handling layer, TensorRT-LLM compiles and runs the actual model.
Why it shows up on the wiki¶
Disclosed as the inference-engine layer of Instacart's generative ads retrieval GPU serving stack:
"To unblock this model serving, the team developed a brand new GPU serving stack. This new system leverages TensorRT-LLM for high-performance inference and is deployed on Nvidia's Triton Inference Server."
The combination of TensorRT-LLM + Triton + Go-native shell is canonicalised as the patterns/gpu-serving-stack-tensorrt-llm-triton pattern.
Why TensorRT-LLM specifically (vs. plain TensorRT or vLLM)¶
TensorRT-LLM is purpose-built for autoregressive decoding at production scale:
- In-flight batching — new requests can join the batch mid-decode, avoiding the head-of-line blocking that plain dynamic batching suffers when one request's longer sequence pins the whole batch.
- KV-cache management — paged / block-sparse KV cache, eviction policies, multi-tenant isolation so the cache doesn't dominate GPU memory.
- Beam search support — first-class primitive (matters for Instacart's CG which uses beam search at serve time, not greedy decoding).
- Quantisation paths — FP8 / INT4 / SmoothQuant pre-built workflows.
For Instacart's autoregressive-decoder-with-beam-search ads-retrieval workload, these primitives are the difference between a viable serving cost and a non-viable one — without them the latency "penalties that previously restricted our catalog coverage" would re-emerge in a different form (per-request KV-cache memory or per-request decoding latency).
Caveats¶
- This is a stub page capturing TensorRT-LLM as a referenced inference engine. Its full architecture has not been deeply ingested on the wiki.
- Specific compilation flags, quantisation choices, and parallelism strategy used by Instacart are not disclosed.
Seen in¶
- sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart — substrate for Instacart's generative ads retrieval GPU stack.
Related¶
- systems/nvidia-triton-inference-server — Triton hosts TensorRT-LLM compiled engines in production.
- systems/instacart-generative-ads-retrieval — wiki-disclosed production user.
- systems/instacart-griffin-2 — Instacart's ML platform that hosts the TensorRT-LLM + Triton stack.
- systems/torch-compile — alternative compilation path used by Meta's SilverTorch (in-graph index tensor paradigm) — different shape, same concern (move from research-PyTorch to optimised-serving-engine).
- patterns/gpu-serving-stack-tensorrt-llm-triton — canonical pattern this system anchors.