CONCEPT Cited by 1 source

Saturation point (inference latency vs token count)¶

Definition¶

The saturation point is the token-count threshold on a specific (model, inference engine, GPU) triple at which transformer inference latency transitions from the approximately-flat memory-bound regime (dominated by fixed per-request overheads) to the approximately-linear compute-bound regime (dominated by FLOPs). Below the saturation point, adding tokens to a batch adds very little latency; above it, latency scales roughly linearly with total token count (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

The flat-then-linear curve¶

Voyage AI's 2025-12-18 profiling of query inference on voyage-3 running on NVIDIA A100 reports "a clear pattern: latency is approximately flat up to a threshold (saturation point) and then becomes approximately linear." For voyage-3 + vLLM + A100 the threshold is ~600 tokens.

Two physical explanations for the shape:

Below the saturation point, per-request fixed costs dominate: GPU kernel launches, scheduling overhead, memory movement, attention-mask setup, final pooling / normalisation for embeddings. These costs are nearly independent of sequence length. Adding tokens barely moves latency.
Above the saturation point, the GPU's compute units are fully occupied and additional tokens each take a roughly equal slice of compute time. Latency starts scaling linearly with token count — the compute-bound regime.

The elbow between the two regimes is the saturation point.

Why it matters for batching¶

Setting the token-count-batching optimal batch size at the saturation point produces the best latency / throughput trade:

Throughput — at the saturation point the GPU's compute is fully used; the batch produces maximum work per unit time. Going above saturates further but adds linearly-scaling latency.
Latency — at the saturation point the batch is still at the end of the flat region; incremental latency per request is near zero. Going below leaves MFU on the table without meaningfully improving latency.
Predictability — batches near the saturation point have tight latency bounds regardless of sequence mix, so tail latency is stable.

Hardware + model + engine specificity¶

Voyage AI is explicit that the saturation point is specific to the (model, inference engine, GPU) triple, not a universal constant:

"The threshold (saturation point) depends on factors like the model architecture, inference engines, and GPU."

Each lever moves the elbow:

Model architecture — more parameters, deeper transformer, larger embedding dimension, or different attention patterns → more FLOPs per token → elbow shifts left (lower tokens for same GPU to saturate) or right (higher memory-bandwidth pressure extends the flat region).
Inference engine — vLLM / SGLang / TGI / TensorRT-LLM / Triton all have different kernel fusions, scheduling costs, and per-request overheads; elbow shifts accordingly.
GPU — A100 → H100 → H200 → B200 roughly doubles compute generationally; L4 / L40S / A10G sit at lower compute + different memory. Each shifts the elbow.

Practical consequence: saturation point must be re-profiled for each new (model, engine, GPU) triple. Voyage AI's 600-token number is illustrative, not transferable.

Relation to MFU and throughput¶

Voyage AI also profiles model FLOPs utilisation (MFU) and throughput vs token count. Both quantities scale approximately linearly with token count until the saturation point — and flatten (MFU plateaus near 1.0) or grow more slowly after. Most query inferences live deep in the memory-bound zone, "far away from the saturation point." Combining short requests via batching moves them up the linear ramp toward the saturation point, recovering MFU and throughput that sequential serving left on the table.

How to find it empirically¶

Voyage AI's implied procedure: sweep synthetic super-sequences of lengths T ∈ {10, 50, 100, 200, 400, 600, 800, 1200, 1600, 2000}, measure end-to-end inference latency on the target model + engine + GPU, plot latency(T). The elbow where slope transitions from ~0 to its asymptotic linear slope is the saturation point. Repeat whenever any lever changes.

Seen in¶

2025-12-18 Voyage AI / MongoDB — Token-count-based batching — canonical wiki instance; voyage-3 + vLLM + A100 saturation point ~600 tokens; headline result of aligning optimal batch size with the saturation point (sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).