Skip to content

PATTERN Cited by 1 source

GPU serving stack — TensorRT-LLM + Triton

Pattern

For ML workloads with autoregressive decoding + beam search + production latency budgets (generative retrieval, LLM inference, sequence-to-sequence), build the serving stack as:

┌──────────────────────────────────────────────┐
│  Service shell (Go-native or equivalent)      │
│  ├─ feature fetch                             │
│  ├─ prompt assembly                           │
│  ├─ response post-processing                  │
│  └─ HTTP/gRPC client to Triton                │
└─────────────────────┬────────────────────────┘
┌──────────────────────────────────────────────┐
│  NVIDIA Triton Inference Server               │
│  ├─ in-flight batching                        │
│  ├─ scheduling / model versioning             │
│  └─ TensorRT-LLM backend                      │
└─────────────────────┬────────────────────────┘
┌──────────────────────────────────────────────┐
│  TensorRT-LLM compiled model                  │
│  ├─ KV-cache management                       │
│  ├─ beam-search support                       │
│  ├─ FP8 / INT4 quantisation                   │
│  └─ tensor / pipeline parallelism             │
└─────────────────────┬────────────────────────┘
                  GPU(s)

Quote (Source: sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart):

"As autoregressive decoding with beam search is fairly compute intensive, it was not viable to serve this model the legacy serving stack that relied on Python and CPU inference. To unblock this model serving, the team developed a brand new GPU serving stack. This new system leverages TensorRT-LLM for high-performance inference and is deployed on Nvidia's Triton Inference Server. … Implemented as a Go-native service, it delivers higher throughput and lower latency compared to the legacy Python environment."

Why each layer

TensorRT-LLM

NVIDIA's high-performance LLM inference compiler. Provides the primitives autoregressive decoding workloads need that plain TensorRT or vLLM may lack:

  • In-flight batching — new requests join the batch mid-decode, avoiding head-of-line blocking from longer-sequence requests.
  • KV-cache management — paged / block-sparse cache, eviction policies, multi-tenant isolation.
  • Beam-search support — first-class primitive (Instacart's generative retrieval uses beam search at serve time).
  • Quantisation paths — FP8, INT4, SmoothQuant pre-built workflows.

Triton Inference Server

The serving runtime that sits above TensorRT-LLM. Provides:

  • Multi-framework support — TensorRT, TensorRT-LLM, ONNX, PyTorch, custom Python — same API.
  • Batching / scheduling — combines requests across clients into GPU-efficient batches.
  • Model versioning — A/B test new model versions, blue-green deploy without downtime.
  • gRPC / HTTP endpoints — standard request shapes for service shells to consume.
  • Ensembles — chain multiple models in one request (e.g. embedding → retrieval → re-ranking).

Go-native service shell

Replaces a Python+CPU shell with a Go-native one. Wins:

  • Higher throughput — Go's goroutine-per-request model handles high-concurrency request shapes more efficiently than Python's GIL-bound or async-flask models.
  • Lower latency — no Python interpreter overhead in the request-handling path.
  • Better isolation from inference engine — service shell doesn't block on GPU work; it issues async calls to Triton.

See patterns/go-native-ml-serving.

What workloads this pattern fits

The pattern is load-bearing for autoregressive workloads, but generalises to other GPU-heavy ML serving:

Workload TensorRT-LLM specific? Pattern still applies
Generative retrieval (Instacart 2026-06) Yes — beam search + autoregressive
LLM inference (chat, summarisation) Yes — main use case
Speculative decoding Yes — TensorRT-LLM has built-in support
Embedding model serving No — plain TensorRT often suffices Depends
Image generation / diffusion No — different cost profile Substitute different inference engine
Classical ranker (DCN, Wide&Deep) No — CPU may suffice Substitute CPU stack

What this replaces

Legacy Python+CPU ML serving (the "legacy serving stack" Instacart explicitly names). The legacy stack's failure modes for autoregressive workloads:

  • Python interpreter overhead in the request-handling path dominates per-request latency.
  • GIL-bound concurrency caps throughput per process.
  • CPU inference for autoregressive decoding is orders of magnitude slower than GPU; beam search amplifies this.
  • No in-flight batching — each request blocks until complete.

The pattern's structural premise: for autoregressive workloads, the legacy stack is "not viable" (Instacart's word), period. The substrate change is necessary, not optional.

Composing with the broader pattern

This serving stack is the substrate ingredient of generative-over- scoring retrieval alongside the substrate change (patterns/rq-vae-codebook-as-product-vocabulary) and the inference paradigm change (concepts/beam-search-retrieval).

generative-over-scoring-retrieval
    ├─ vocabulary substrate: rq-vae-codebook-as-product-vocabulary
    ├─ inference paradigm:    beam-search-with-retailer-partitioned-mapping
    └─ serving substrate:     gpu-serving-stack-tensorrt-llm-triton  ← this pattern
                                  └─ go-native-ml-serving

Caveats

  • Not the cheapest serving substrate. GPUs are more expensive per request than CPUs. The pattern is justified by workloads where CPU is unviable at production latency — not workloads where CPU is merely slower.
  • GPU SKU selection matters. TensorRT-LLM optimises for specific NVIDIA architectures (Ampere / Hopper / Blackwell); older GPUs may not benefit fully.
  • Triton + TensorRT-LLM has a learning curve for teams used to pure Python serving. Custom backends, ensembles, dynamic batching configuration are non-trivial.
  • Cold-start considerations — TensorRT-LLM compiled engines load to GPU on first request; serving systems need warming strategies for new model versions.
  • Multi-tenancy / capacity-allocation across tenants in Triton is not a default; teams hosting many models on one Triton instance need explicit config.
  • Specific Instacart deployment details not disclosed. GPU SKU, cluster topology, request concurrency, beam width, model parameter count all undisclosed.

Seen in

Last updated · 542 distilled / 1,571 read