Skip to content

SYSTEM Cited by 1 source

NVIDIA Triton Inference Server

Definition

NVIDIA Triton Inference Server is an open-source ML serving runtime for production inference. Triton sits above compiled inference engines (TensorRT, TensorRT-LLM, ONNX Runtime, PyTorch, TensorFlow, Python custom backends) and exposes them as gRPC / HTTP endpoints with batching, scheduling, ensembles, model-versioning, and multi-model concurrency primitives.

Why it shows up on the wiki

Disclosed as the serving-runtime layer of Instacart's generative ads retrieval GPU stack:

"This new system leverages TensorRT-LLM for high-performance inference and is deployed on Nvidia's Triton Inference Server."

The full Instacart serving stack:

┌────────────────────────────────────────────────────┐
│  Go-native service shell (Griffin 2.0)              │
│  ├─ feature fetch / prompt assembly                 │
│  ├─ HTTP/gRPC client to Triton                      │
│  └─ retailer-partitioned index lookup post-decode   │
└─────────────────────────┬──────────────────────────┘
┌────────────────────────────────────────────────────┐
│  NVIDIA Triton Inference Server                     │
│  ├─ batching / scheduling / model versioning        │
│  └─ TensorRT-LLM backend (compiled decoder)         │
└─────────────────────────┬──────────────────────────┘
                  GPU (NVIDIA, SKU undisclosed)

The Go-native service shell handles request-handling, feature fetching, prompt assembly, and post-decode index lookup; Triton hosts the TensorRT-LLM compiled decoder; the GPU runs the actual beam-search autoregressive decoding.

Caveats

  • This is a stub page capturing Triton as a referenced ML serving runtime. Its full architecture has not been deeply ingested on the wiki.
  • Specific Triton features used by Instacart (ensembles, business logic scripting, response cache) not disclosed.
  • Multi-tenancy / capacity-allocation / GPU-SKU details for Instacart's deployment not disclosed.

Seen in

Last updated · 542 distilled / 1,571 read