SYSTEM Cited by 1 source
NVIDIA Triton Inference Server¶
Definition¶
NVIDIA Triton Inference Server is an open-source ML serving runtime for production inference. Triton sits above compiled inference engines (TensorRT, TensorRT-LLM, ONNX Runtime, PyTorch, TensorFlow, Python custom backends) and exposes them as gRPC / HTTP endpoints with batching, scheduling, ensembles, model-versioning, and multi-model concurrency primitives.
Why it shows up on the wiki¶
Disclosed as the serving-runtime layer of Instacart's generative ads retrieval GPU stack:
"This new system leverages TensorRT-LLM for high-performance inference and is deployed on Nvidia's Triton Inference Server."
The full Instacart serving stack:
┌────────────────────────────────────────────────────┐
│ Go-native service shell (Griffin 2.0) │
│ ├─ feature fetch / prompt assembly │
│ ├─ HTTP/gRPC client to Triton │
│ └─ retailer-partitioned index lookup post-decode │
└─────────────────────────┬──────────────────────────┘
▼
┌────────────────────────────────────────────────────┐
│ NVIDIA Triton Inference Server │
│ ├─ batching / scheduling / model versioning │
│ └─ TensorRT-LLM backend (compiled decoder) │
└─────────────────────────┬──────────────────────────┘
▼
GPU (NVIDIA, SKU undisclosed)
The Go-native service shell handles request-handling, feature fetching, prompt assembly, and post-decode index lookup; Triton hosts the TensorRT-LLM compiled decoder; the GPU runs the actual beam-search autoregressive decoding.
Caveats¶
- This is a stub page capturing Triton as a referenced ML serving runtime. Its full architecture has not been deeply ingested on the wiki.
- Specific Triton features used by Instacart (ensembles, business logic scripting, response cache) not disclosed.
- Multi-tenancy / capacity-allocation / GPU-SKU details for Instacart's deployment not disclosed.
Seen in¶
- sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart — runtime layer for Instacart's generative ads retrieval GPU stack.
Related¶
- systems/tensorrt-llm — primary inference engine Triton hosts for LLM workloads.
- systems/instacart-generative-ads-retrieval — production user.
- systems/instacart-griffin-2 — Instacart's ML platform host.
- patterns/gpu-serving-stack-tensorrt-llm-triton — the canonical pattern.