CONCEPT Cited by 1 source

GPU kernel utilization¶

Definition¶

GPU kernel utilization is the percentage of wall-clock time during which the GPU is actively executing compute kernels rather than idle. Concretely, in a profiled trace of an inference run, kernel utilisation is:

kernel_utilization = (time_with_kernels_running)
                   / (wall_clock_time)

It is the right saturation metric for chunked inference pipelines because it captures all sources of GPU idle — including time spent waiting on D2H transfers, host I/O, or the main thread launching the next batch — not just whether the kernels themselves are FLOP-saturated.

Why kernel utilisation is distinct from MFU¶

MFU (Model FLOPs Utilization) asks: "when the GPU is computing, how close is it to peak FLOPs?"

GPU kernel utilisation asks: "how much of wall-clock is the GPU even computing?"

Both can be 99% in a perfectly-tuned system. They diverge when:

Kernel utilisation = 99%, MFU = 30%. GPU is always busy but the kernels themselves are memory-bound, attention-bandwidth- bound, or have low arithmetic intensity. Compute is happening but not optimally.
Kernel utilisation = 82%, MFU = 95%. When kernels run, they saturate the FLOPs — but the GPU is idle 18% of wall-clock because of D2H stalls, host-I/O serialisation, or main-thread bottlenecks. This is the failure mode the patterns/asynchronous-frame-generation-pipeline addresses.

The two metrics measure orthogonal failure modes: kernel utilisation is about scheduling/overlap, MFU is about kernel-level compute efficiency. Both must be high for the inference pipeline to be efficient.

Wiki-attested datapoint — Synthesia VAE decoder on G7e¶

Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances.

On a g7e.2xlarge instance running the Hugging Face Diffusers Wan 2.2 14B VAE decoder:

Pipeline	GPU kernel utilization (steady state, two consecutive chunks)
Synchronous frame generation	82%
Asynchronous frame generation	99.9%

The 82% baseline is not because the kernels themselves are inefficient — it's because the GPU sits idle between chunks while the CPU thread drains decoded frames to disk over a synchronous D2H copy + file write. The 99.9% post-optimisation figure comes from overlapping those drains with the next chunk's compute via dual CUDA streams + pinned host buffers + a dedicated worker thread + double buffering.

Translation to user-visible metrics: 8.2% decode-latency reduction and a Real Time Factor improvement from 3.21 → 2.95.

Where the 18% of idle GPU time goes (synchronous baseline)¶

Profiling the synchronous Wan 2.2 14B VAE decoder run shows:

CPU thread writing chunk N to disk — this is the dominant cause of idle GPU. While the main Python thread is busy with file I/O, it is not launching CUDA kernels for chunk N+1.
D2H copy itself — even with PCIe bandwidth, copying a full chunk's worth of pixels is non-trivial.
Single CUDA stream serialisation — by default, PyTorch schedules everything on one CUDA stream per device, so D2H copies serialise behind compute kernels.

Removing these one at a time corresponds to:

Worker thread → main thread free to launch kernels.
Pinned memory → D2H runs on the GPU's copy engine without bouncing.
Dual CUDA streams → compute and copy run concurrently on the GPU's separate hardware engines.

Generalisation¶

The kernel-utilisation metric is the right diagnostic anywhere chunk-by-chunk inference produces output that has to leave the GPU between chunks:

Latent-diffusion video (the wiki-attested case).
Image generation at batch granularity (decode batch N's pixels while batch N+1 runs).
Streaming generative audio at audio-frame granularity.
LLM batch serving at the post-processing boundary — see concepts/async-cpu-gpu-pipelined-scheduling for the LLM- serving canonicalisation.

For workloads where the kernel-utilisation gap is small (single- shot inference with no per-chunk D2H), the metric collapses toward MFU and ceases to be informative.

Seen in¶

sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances — first wiki canonicalisation as a distinct named metric. 82% → 99.9% via the Asynchronous Frame Generation Pipeline on the unoptimised Wan 2.2 14B Diffusers VAE decoder.

concepts/model-flops-utilization / concepts/mfu-model-flops-utilization — orthogonal metric about kernel-level FLOP efficiency.
concepts/cuda-throughput-budget — per-workload throughput profile (QPS per GPU) — depends on both kernel utilisation and MFU being high.
concepts/cpu-utilization-vs-saturation — the broader utilisation-vs-saturation framing this concept inherits from.
concepts/device-to-host-transfer — the dominant cause of the kernel-utilisation gap for chunked inference.
concepts/cuda-stream — primitive used to close the gap.
concepts/async-cpu-gpu-pipelined-scheduling — same shape at LLM-serving altitude.
patterns/asynchronous-frame-generation-pipeline — the pattern that lifts kernel utilisation from 82% to 99.9%.