CONCEPT Cited by 1 source
Device-to-host (D2H) transfer¶
Definition¶
A device-to-host transfer (D2H copy) is the operation of moving bytes from GPU memory (VRAM) to CPU-accessible host memory (RAM). It is the inverse of a host-to-device (H2D) transfer, which moves bytes the other way at model-load time or when uploading inputs.
D2H transfer has a dedicated hardware copy engine on modern NVIDIA GPUs that is physically separate from the SM compute units. This separation is what lets compute kernels and D2H copies run concurrently — but only if the host code uses two CUDA streams and pinned host buffers. By default, both operations serialise on the single default CUDA stream and on a single pageable bounce buffer, eliminating the parallelism the hardware would otherwise allow.
Why D2H is an inference bottleneck for chunked generative video¶
In a chunked generative video pipeline (latent diffusion + chunked VAE decoding):
Without overlap, the chunk-N D2H + host-I/O steps gate the chunk-N+1 GPU step. The GPU stalls during the D2H + host-I/O window. AWS / Synthesia's measured baseline shows the GPU sitting idle ~18% of wall-clock time on the unoptimised Hugging Face Diffusers Wan 2.2 14B VAE decoder for exactly this reason. (Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances.)
The standard fix has three pieces, all related to D2H:
- Issue D2H on a dedicated Copy Stream rather than the default Compute Stream — so D2H runs on the GPU's copy engine concurrently with the next chunk's compute on the SMs.
- Use pinned (page-locked) host buffers as the D2H destination so the GPU's copy engine can DMA bytes directly without staging through a pageable bounce buffer (which would force serialisation with the default stream).
- Double-buffer so the chunk N D2H and the chunk N+1 compute can target distinct memory regions without aliasing.
Together these three lift kernel utilisation from 82% to 99.9% on the wiki-attested benchmark.
Why "memory-efficient inference" forces this bottleneck¶
Holding the entire decoded video in VRAM until done would avoid per-chunk D2H — but breaks the scale-to-arbitrarily-long-video invariant. Once the pipeline is committed to chunked decoding for memory-efficiency reasons, the per-chunk D2H + storage commit becomes the structural bottleneck. The fix is therefore not "do less D2H" but "overlap D2H with compute".
This is a clean instance of the broader pattern that memory- efficiency optimisations create new serialisation points that must then be optimised back out via overlap or batching.
D2H vs H2D — asymmetric importance¶
H2D (model load + input upload) is typically a one-time or infrequent event in inference: model weights load once at startup, inputs upload once per request. D2H is per-output: every chunk, every frame, every batch produces output that has to leave the GPU. That asymmetry is why D2H is more often the bottleneck than H2D for steady-state inference.
Generalisation¶
D2H bottlenecks recur across:
- VAE decoders in latent-diffusion video and image pipelines (wiki-attested).
- LLM batch serving post-processing — decoded tokens have to leave the GPU before a downstream stage runs; see concepts/async-cpu-gpu-pipelined-scheduling for the LLM- altitude canonicalisation.
- Streaming generative audio at audio-frame granularity.
- Graphics readback — see
concepts/synchronous-vs-asynchronous-readback for the
graphics-API altitude (WebGL
gl.readPixelsvs WebGPUmapAsync).
The same primitives — dual streams, pinned memory, double buffering, cross-stream barriers — apply at every altitude.
Seen in¶
- sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances — first wiki canonicalisation. "Once a chunk has been decoded and processed on the GPU, the corresponding pixel frames must be transferred back to host (CPU) memory with a D2H transfer, so they can be written to a file or further processed." The D2H step is named as the binding bottleneck for the chunked VAE-decoder pipeline; the Asynchronous Frame Generation Pipeline overlaps it with the next chunk's compute.
Related¶
- concepts/synchronous-vs-asynchronous-readback — same sync/async axis at graphics-API altitude.
- concepts/pinned-memory — required for fully-async D2H DMA.
- concepts/cuda-stream — primitive that lets D2H run on the copy engine concurrently with compute.
- concepts/gpu-kernel-utilization — saturation metric whose gap is dominated by D2H stalls in the synchronous baseline.
- concepts/async-cpu-gpu-pipelined-scheduling — same overlap shape at LLM-batch-serving altitude.
- patterns/asynchronous-frame-generation-pipeline — full fix pattern.
- patterns/dual-cuda-stream-compute-and-copy-overlap — the CUDA-stream half of the fix.