Skip to content

SYSTEM Cited by 1 source

TensorRT-LLM

Definition

TensorRT-LLM is NVIDIA's open-source library for high-performance inference of large language models on NVIDIA GPUs. It compiles Transformer models (encoder, decoder, encoder-decoder) into optimised GPU-runnable engines, with primitives tailored to autoregressive decoding workloads — KV-cache management, in-flight batching, beam search, speculative decoding, FP8 / INT4 quantisation, and tensor / pipeline parallelism.

In production deployments TensorRT-LLM is typically paired with the NVIDIA Triton Inference Server — Triton hosts the request-handling layer, TensorRT-LLM compiles and runs the actual model.

Why it shows up on the wiki

Disclosed as the inference-engine layer of Instacart's generative ads retrieval GPU serving stack:

"To unblock this model serving, the team developed a brand new GPU serving stack. This new system leverages TensorRT-LLM for high-performance inference and is deployed on Nvidia's Triton Inference Server."

The combination of TensorRT-LLM + Triton + Go-native shell is canonicalised as the patterns/gpu-serving-stack-tensorrt-llm-triton pattern.

Why TensorRT-LLM specifically (vs. plain TensorRT or vLLM)

TensorRT-LLM is purpose-built for autoregressive decoding at production scale:

  • In-flight batching — new requests can join the batch mid-decode, avoiding the head-of-line blocking that plain dynamic batching suffers when one request's longer sequence pins the whole batch.
  • KV-cache management — paged / block-sparse KV cache, eviction policies, multi-tenant isolation so the cache doesn't dominate GPU memory.
  • Beam search support — first-class primitive (matters for Instacart's CG which uses beam search at serve time, not greedy decoding).
  • Quantisation paths — FP8 / INT4 / SmoothQuant pre-built workflows.

For Instacart's autoregressive-decoder-with-beam-search ads-retrieval workload, these primitives are the difference between a viable serving cost and a non-viable one — without them the latency "penalties that previously restricted our catalog coverage" would re-emerge in a different form (per-request KV-cache memory or per-request decoding latency).

Caveats

  • This is a stub page capturing TensorRT-LLM as a referenced inference engine. Its full architecture has not been deeply ingested on the wiki.
  • Specific compilation flags, quantisation choices, and parallelism strategy used by Instacart are not disclosed.

Seen in

Last updated · 542 distilled / 1,571 read