Skip to content

CONCEPT Cited by 1 source

GPU kernel utilization

Definition

GPU kernel utilization is the percentage of wall-clock time during which the GPU is actively executing compute kernels rather than idle. Concretely, in a profiled trace of an inference run, kernel utilisation is:

kernel_utilization = (time_with_kernels_running)
                   / (wall_clock_time)

It is the right saturation metric for chunked inference pipelines because it captures all sources of GPU idle — including time spent waiting on D2H transfers, host I/O, or the main thread launching the next batch — not just whether the kernels themselves are FLOP-saturated.

Why kernel utilisation is distinct from MFU

MFU (Model FLOPs Utilization) asks: "when the GPU is computing, how close is it to peak FLOPs?"

GPU kernel utilisation asks: "how much of wall-clock is the GPU even computing?"

Both can be 99% in a perfectly-tuned system. They diverge when:

  • Kernel utilisation = 99%, MFU = 30%. GPU is always busy but the kernels themselves are memory-bound, attention-bandwidth- bound, or have low arithmetic intensity. Compute is happening but not optimally.
  • Kernel utilisation = 82%, MFU = 95%. When kernels run, they saturate the FLOPs — but the GPU is idle 18% of wall-clock because of D2H stalls, host-I/O serialisation, or main-thread bottlenecks. This is the failure mode the patterns/asynchronous-frame-generation-pipeline addresses.

The two metrics measure orthogonal failure modes: kernel utilisation is about scheduling/overlap, MFU is about kernel-level compute efficiency. Both must be high for the inference pipeline to be efficient.

Wiki-attested datapoint — Synthesia VAE decoder on G7e

Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances.

On a g7e.2xlarge instance running the Hugging Face Diffusers Wan 2.2 14B VAE decoder:

Pipeline GPU kernel utilization (steady state, two consecutive chunks)
Synchronous frame generation 82%
Asynchronous frame generation 99.9%

The 82% baseline is not because the kernels themselves are inefficient — it's because the GPU sits idle between chunks while the CPU thread drains decoded frames to disk over a synchronous D2H copy + file write. The 99.9% post-optimisation figure comes from overlapping those drains with the next chunk's compute via dual CUDA streams + pinned host buffers + a dedicated worker thread + double buffering.

Translation to user-visible metrics: 8.2% decode-latency reduction and a Real Time Factor improvement from 3.21 → 2.95.

Where the 18% of idle GPU time goes (synchronous baseline)

Profiling the synchronous Wan 2.2 14B VAE decoder run shows:

  • CPU thread writing chunk N to disk — this is the dominant cause of idle GPU. While the main Python thread is busy with file I/O, it is not launching CUDA kernels for chunk N+1.
  • D2H copy itself — even with PCIe bandwidth, copying a full chunk's worth of pixels is non-trivial.
  • Single CUDA stream serialisation — by default, PyTorch schedules everything on one CUDA stream per device, so D2H copies serialise behind compute kernels.

Removing these one at a time corresponds to:

  • Worker thread → main thread free to launch kernels.
  • Pinned memory → D2H runs on the GPU's copy engine without bouncing.
  • Dual CUDA streams → compute and copy run concurrently on the GPU's separate hardware engines.

Generalisation

The kernel-utilisation metric is the right diagnostic anywhere chunk-by-chunk inference produces output that has to leave the GPU between chunks:

  • Latent-diffusion video (the wiki-attested case).
  • Image generation at batch granularity (decode batch N's pixels while batch N+1 runs).
  • Streaming generative audio at audio-frame granularity.
  • LLM batch serving at the post-processing boundary — see concepts/async-cpu-gpu-pipelined-scheduling for the LLM- serving canonicalisation.

For workloads where the kernel-utilisation gap is small (single- shot inference with no per-chunk D2H), the metric collapses toward MFU and ceases to be informative.

Seen in

Last updated · 542 distilled / 1,571 read