SYSTEM Cited by 1 source

NVIDIA Triton Inference Server¶

Definition¶

NVIDIA Triton Inference Server is an open-source ML serving runtime for production inference. Triton sits above compiled inference engines (TensorRT, TensorRT-LLM, ONNX Runtime, PyTorch, TensorFlow, Python custom backends) and exposes them as gRPC / HTTP endpoints with batching, scheduling, ensembles, model-versioning, and multi-model concurrency primitives.

Why it shows up on the wiki¶

Disclosed as the serving-runtime layer of Instacart's generative ads retrieval GPU stack:

"This new system leverages TensorRT-LLM for high-performance inference and is deployed on Nvidia's Triton Inference Server."

The full Instacart serving stack:

┌────────────────────────────────────────────────────┐
│  Go-native service shell (Griffin 2.0)              │
│  ├─ feature fetch / prompt assembly                 │
│  ├─ HTTP/gRPC client to Triton                      │
│  └─ retailer-partitioned index lookup post-decode   │
└─────────────────────────┬──────────────────────────┘
                          ▼
┌────────────────────────────────────────────────────┐
│  NVIDIA Triton Inference Server                     │
│  ├─ batching / scheduling / model versioning        │
│  └─ TensorRT-LLM backend (compiled decoder)         │
└─────────────────────────┬──────────────────────────┘
                          ▼
                  GPU (NVIDIA, SKU undisclosed)

The Go-native service shell handles request-handling, feature fetching, prompt assembly, and post-decode index lookup; Triton hosts the TensorRT-LLM compiled decoder; the GPU runs the actual beam-search autoregressive decoding.

Caveats¶

This is a stub page capturing Triton as a referenced ML serving runtime. Its full architecture has not been deeply ingested on the wiki.
Specific Triton features used by Instacart (ensembles, business logic scripting, response cache) not disclosed.
Multi-tenancy / capacity-allocation / GPU-SKU details for Instacart's deployment not disclosed.

Seen in¶

sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart — runtime layer for Instacart's generative ads retrieval GPU stack.

systems/tensorrt-llm — primary inference engine Triton hosts for LLM workloads.
systems/instacart-generative-ads-retrieval — production user.
systems/instacart-griffin-2 — Instacart's ML platform host.
patterns/gpu-serving-stack-tensorrt-llm-triton — the canonical pattern.

NVIDIA Triton Inference Server¶

Definition¶

Why it shows up on the wiki¶

Caveats¶

Seen in¶

Related¶