CONCEPT Cited by 1 source
GPU kernel utilization¶
Definition¶
GPU kernel utilization is the percentage of wall-clock time during which the GPU is actively executing compute kernels rather than idle. Concretely, in a profiled trace of an inference run, kernel utilisation is:
It is the right saturation metric for chunked inference pipelines because it captures all sources of GPU idle — including time spent waiting on D2H transfers, host I/O, or the main thread launching the next batch — not just whether the kernels themselves are FLOP-saturated.
Why kernel utilisation is distinct from MFU¶
MFU (Model FLOPs Utilization) asks: "when the GPU is computing, how close is it to peak FLOPs?"
GPU kernel utilisation asks: "how much of wall-clock is the GPU even computing?"
Both can be 99% in a perfectly-tuned system. They diverge when:
- Kernel utilisation = 99%, MFU = 30%. GPU is always busy but the kernels themselves are memory-bound, attention-bandwidth- bound, or have low arithmetic intensity. Compute is happening but not optimally.
- Kernel utilisation = 82%, MFU = 95%. When kernels run, they saturate the FLOPs — but the GPU is idle 18% of wall-clock because of D2H stalls, host-I/O serialisation, or main-thread bottlenecks. This is the failure mode the patterns/asynchronous-frame-generation-pipeline addresses.
The two metrics measure orthogonal failure modes: kernel utilisation is about scheduling/overlap, MFU is about kernel-level compute efficiency. Both must be high for the inference pipeline to be efficient.
Wiki-attested datapoint — Synthesia VAE decoder on G7e¶
On a g7e.2xlarge instance running the Hugging Face Diffusers Wan 2.2 14B VAE decoder:
| Pipeline | GPU kernel utilization (steady state, two consecutive chunks) |
|---|---|
| Synchronous frame generation | 82% |
| Asynchronous frame generation | 99.9% |
The 82% baseline is not because the kernels themselves are inefficient — it's because the GPU sits idle between chunks while the CPU thread drains decoded frames to disk over a synchronous D2H copy + file write. The 99.9% post-optimisation figure comes from overlapping those drains with the next chunk's compute via dual CUDA streams + pinned host buffers + a dedicated worker thread + double buffering.
Translation to user-visible metrics: 8.2% decode-latency reduction and a Real Time Factor improvement from 3.21 → 2.95.
Where the 18% of idle GPU time goes (synchronous baseline)¶
Profiling the synchronous Wan 2.2 14B VAE decoder run shows:
- CPU thread writing chunk N to disk — this is the dominant cause of idle GPU. While the main Python thread is busy with file I/O, it is not launching CUDA kernels for chunk N+1.
- D2H copy itself — even with PCIe bandwidth, copying a full chunk's worth of pixels is non-trivial.
- Single CUDA stream serialisation — by default, PyTorch schedules everything on one CUDA stream per device, so D2H copies serialise behind compute kernels.
Removing these one at a time corresponds to:
- Worker thread → main thread free to launch kernels.
- Pinned memory → D2H runs on the GPU's copy engine without bouncing.
- Dual CUDA streams → compute and copy run concurrently on the GPU's separate hardware engines.
Generalisation¶
The kernel-utilisation metric is the right diagnostic anywhere chunk-by-chunk inference produces output that has to leave the GPU between chunks:
- Latent-diffusion video (the wiki-attested case).
- Image generation at batch granularity (decode batch N's pixels while batch N+1 runs).
- Streaming generative audio at audio-frame granularity.
- LLM batch serving at the post-processing boundary — see concepts/async-cpu-gpu-pipelined-scheduling for the LLM- serving canonicalisation.
For workloads where the kernel-utilisation gap is small (single- shot inference with no per-chunk D2H), the metric collapses toward MFU and ceases to be informative.
Seen in¶
- sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances — first wiki canonicalisation as a distinct named metric. 82% → 99.9% via the Asynchronous Frame Generation Pipeline on the unoptimised Wan 2.2 14B Diffusers VAE decoder.
Related¶
- concepts/model-flops-utilization / concepts/mfu-model-flops-utilization — orthogonal metric about kernel-level FLOP efficiency.
- concepts/cuda-throughput-budget — per-workload throughput profile (QPS per GPU) — depends on both kernel utilisation and MFU being high.
- concepts/cpu-utilization-vs-saturation — the broader utilisation-vs-saturation framing this concept inherits from.
- concepts/device-to-host-transfer — the dominant cause of the kernel-utilisation gap for chunked inference.
- concepts/cuda-stream — primitive used to close the gap.
- concepts/async-cpu-gpu-pipelined-scheduling — same shape at LLM-serving altitude.
- patterns/asynchronous-frame-generation-pipeline — the pattern that lifts kernel utilisation from 82% to 99.9%.