SYSTEM Cited by 1 source

TensorRT-LLM¶

Definition¶

TensorRT-LLM is NVIDIA's open-source library for high-performance inference of large language models on NVIDIA GPUs. It compiles Transformer models (encoder, decoder, encoder-decoder) into optimised GPU-runnable engines, with primitives tailored to autoregressive decoding workloads — KV-cache management, in-flight batching, beam search, speculative decoding, FP8 / INT4 quantisation, and tensor / pipeline parallelism.

In production deployments TensorRT-LLM is typically paired with the NVIDIA Triton Inference Server — Triton hosts the request-handling layer, TensorRT-LLM compiles and runs the actual model.

Why it shows up on the wiki¶

Disclosed as the inference-engine layer of Instacart's generative ads retrieval GPU serving stack:

"To unblock this model serving, the team developed a brand new GPU serving stack. This new system leverages TensorRT-LLM for high-performance inference and is deployed on Nvidia's Triton Inference Server."

The combination of TensorRT-LLM + Triton + Go-native shell is canonicalised as the patterns/gpu-serving-stack-tensorrt-llm-triton pattern.

Why TensorRT-LLM specifically (vs. plain TensorRT or vLLM)¶

TensorRT-LLM is purpose-built for autoregressive decoding at production scale:

In-flight batching — new requests can join the batch mid-decode, avoiding the head-of-line blocking that plain dynamic batching suffers when one request's longer sequence pins the whole batch.
KV-cache management — paged / block-sparse KV cache, eviction policies, multi-tenant isolation so the cache doesn't dominate GPU memory.
Beam search support — first-class primitive (matters for Instacart's CG which uses beam search at serve time, not greedy decoding).
Quantisation paths — FP8 / INT4 / SmoothQuant pre-built workflows.

For Instacart's autoregressive-decoder-with-beam-search ads-retrieval workload, these primitives are the difference between a viable serving cost and a non-viable one — without them the latency "penalties that previously restricted our catalog coverage" would re-emerge in a different form (per-request KV-cache memory or per-request decoding latency).

Caveats¶

This is a stub page capturing TensorRT-LLM as a referenced inference engine. Its full architecture has not been deeply ingested on the wiki.
Specific compilation flags, quantisation choices, and parallelism strategy used by Instacart are not disclosed.

Seen in¶

sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart — substrate for Instacart's generative ads retrieval GPU stack.

systems/nvidia-triton-inference-server — Triton hosts TensorRT-LLM compiled engines in production.
systems/instacart-generative-ads-retrieval — wiki-disclosed production user.
systems/instacart-griffin-2 — Instacart's ML platform that hosts the TensorRT-LLM + Triton stack.
systems/torch-compile — alternative compilation path used by Meta's SilverTorch (in-graph index tensor paradigm) — different shape, same concern (move from research-PyTorch to optimised-serving-engine).
patterns/gpu-serving-stack-tensorrt-llm-triton — canonical pattern this system anchors.