PATTERN Cited by 1 source

GPU serving stack — TensorRT-LLM + Triton¶

Pattern¶

For ML workloads with autoregressive decoding + beam search + production latency budgets (generative retrieval, LLM inference, sequence-to-sequence), build the serving stack as:

┌──────────────────────────────────────────────┐
│  Service shell (Go-native or equivalent)      │
│  ├─ feature fetch                             │
│  ├─ prompt assembly                           │
│  ├─ response post-processing                  │
│  └─ HTTP/gRPC client to Triton                │
└─────────────────────┬────────────────────────┘
                      ▼
┌──────────────────────────────────────────────┐
│  NVIDIA Triton Inference Server               │
│  ├─ in-flight batching                        │
│  ├─ scheduling / model versioning             │
│  └─ TensorRT-LLM backend                      │
└─────────────────────┬────────────────────────┘
                      ▼
┌──────────────────────────────────────────────┐
│  TensorRT-LLM compiled model                  │
│  ├─ KV-cache management                       │
│  ├─ beam-search support                       │
│  ├─ FP8 / INT4 quantisation                   │
│  └─ tensor / pipeline parallelism             │
└─────────────────────┬────────────────────────┘
                      ▼
                  GPU(s)

Quote (Source: sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart):

"As autoregressive decoding with beam search is fairly compute intensive, it was not viable to serve this model the legacy serving stack that relied on Python and CPU inference. To unblock this model serving, the team developed a brand new GPU serving stack. This new system leverages TensorRT-LLM for high-performance inference and is deployed on Nvidia's Triton Inference Server. … Implemented as a Go-native service, it delivers higher throughput and lower latency compared to the legacy Python environment."

Why each layer¶

TensorRT-LLM¶

NVIDIA's high-performance LLM inference compiler. Provides the primitives autoregressive decoding workloads need that plain TensorRT or vLLM may lack:

In-flight batching — new requests join the batch mid-decode, avoiding head-of-line blocking from longer-sequence requests.
KV-cache management — paged / block-sparse cache, eviction policies, multi-tenant isolation.
Beam-search support — first-class primitive (Instacart's generative retrieval uses beam search at serve time).
Quantisation paths — FP8, INT4, SmoothQuant pre-built workflows.

Triton Inference Server¶

The serving runtime that sits above TensorRT-LLM. Provides:

Multi-framework support — TensorRT, TensorRT-LLM, ONNX, PyTorch, custom Python — same API.
Batching / scheduling — combines requests across clients into GPU-efficient batches.
Model versioning — A/B test new model versions, blue-green deploy without downtime.
gRPC / HTTP endpoints — standard request shapes for service shells to consume.
Ensembles — chain multiple models in one request (e.g. embedding → retrieval → re-ranking).

Go-native service shell¶

Replaces a Python+CPU shell with a Go-native one. Wins:

Higher throughput — Go's goroutine-per-request model handles high-concurrency request shapes more efficiently than Python's GIL-bound or async-flask models.
Lower latency — no Python interpreter overhead in the request-handling path.
Better isolation from inference engine — service shell doesn't block on GPU work; it issues async calls to Triton.

See patterns/go-native-ml-serving.

What workloads this pattern fits¶

The pattern is load-bearing for autoregressive workloads, but generalises to other GPU-heavy ML serving:

Workload	TensorRT-LLM specific?	Pattern still applies
Generative retrieval (Instacart 2026-06)	Yes — beam search + autoregressive	✅
LLM inference (chat, summarisation)	Yes — main use case	✅
Speculative decoding	Yes — TensorRT-LLM has built-in support	✅
Embedding model serving	No — plain TensorRT often suffices	Depends
Image generation / diffusion	No — different cost profile	Substitute different inference engine
Classical ranker (DCN, Wide&Deep)	No — CPU may suffice	Substitute CPU stack

What this replaces¶

Legacy Python+CPU ML serving (the "legacy serving stack" Instacart explicitly names). The legacy stack's failure modes for autoregressive workloads:

Python interpreter overhead in the request-handling path dominates per-request latency.
GIL-bound concurrency caps throughput per process.
CPU inference for autoregressive decoding is orders of magnitude slower than GPU; beam search amplifies this.
No in-flight batching — each request blocks until complete.

The pattern's structural premise: for autoregressive workloads, the legacy stack is "not viable" (Instacart's word), period. The substrate change is necessary, not optional.

Composing with the broader pattern¶

This serving stack is the substrate ingredient of generative-over- scoring retrieval alongside the substrate change (patterns/rq-vae-codebook-as-product-vocabulary) and the inference paradigm change (concepts/beam-search-retrieval).

generative-over-scoring-retrieval
    ├─ vocabulary substrate: rq-vae-codebook-as-product-vocabulary
    ├─ inference paradigm:    beam-search-with-retailer-partitioned-mapping
    └─ serving substrate:     gpu-serving-stack-tensorrt-llm-triton  ← this pattern
                                  └─ go-native-ml-serving

Caveats¶

Not the cheapest serving substrate. GPUs are more expensive per request than CPUs. The pattern is justified by workloads where CPU is unviable at production latency — not workloads where CPU is merely slower.
GPU SKU selection matters. TensorRT-LLM optimises for specific NVIDIA architectures (Ampere / Hopper / Blackwell); older GPUs may not benefit fully.
Triton + TensorRT-LLM has a learning curve for teams used to pure Python serving. Custom backends, ensembles, dynamic batching configuration are non-trivial.
Cold-start considerations — TensorRT-LLM compiled engines load to GPU on first request; serving systems need warming strategies for new model versions.
Multi-tenancy / capacity-allocation across tenants in Triton is not a default; teams hosting many models on one Triton instance need explicit config.
Specific Instacart deployment details not disclosed. GPU SKU, cluster topology, request concurrency, beam width, model parameter count all undisclosed.

Seen in¶

sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart — Instacart Generative Ads Retrieval GPU stack; first wiki canonical disclosure with operational outcome (10-17% mean latency reduction despite 2× candidate volume).

systems/tensorrt-llm / systems/nvidia-triton-inference-server — substrate components.
systems/instacart-generative-ads-retrieval / systems/instacart-griffin-2 — production instance + ML platform host.
patterns/generative-over-scoring-retrieval — the broader pattern this serves.
patterns/go-native-ml-serving — the service-shell ingredient.