Skip to content

CONCEPT Cited by 1 source

Device-to-host (D2H) transfer

Definition

A device-to-host transfer (D2H copy) is the operation of moving bytes from GPU memory (VRAM) to CPU-accessible host memory (RAM). It is the inverse of a host-to-device (H2D) transfer, which moves bytes the other way at model-load time or when uploading inputs.

D2H transfer has a dedicated hardware copy engine on modern NVIDIA GPUs that is physically separate from the SM compute units. This separation is what lets compute kernels and D2H copies run concurrently — but only if the host code uses two CUDA streams and pinned host buffers. By default, both operations serialise on the single default CUDA stream and on a single pageable bounce buffer, eliminating the parallelism the hardware would otherwise allow.

Why D2H is an inference bottleneck for chunked generative video

In a chunked generative video pipeline (latent diffusion + chunked VAE decoding):

[GPU decode chunk N] → [D2H copy chunk N] → [host I/O chunk N]
                                            → [GPU decode chunk N+1]

Without overlap, the chunk-N D2H + host-I/O steps gate the chunk-N+1 GPU step. The GPU stalls during the D2H + host-I/O window. AWS / Synthesia's measured baseline shows the GPU sitting idle ~18% of wall-clock time on the unoptimised Hugging Face Diffusers Wan 2.2 14B VAE decoder for exactly this reason. (Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances.)

The standard fix has three pieces, all related to D2H:

  1. Issue D2H on a dedicated Copy Stream rather than the default Compute Stream — so D2H runs on the GPU's copy engine concurrently with the next chunk's compute on the SMs.
  2. Use pinned (page-locked) host buffers as the D2H destination so the GPU's copy engine can DMA bytes directly without staging through a pageable bounce buffer (which would force serialisation with the default stream).
  3. Double-buffer so the chunk N D2H and the chunk N+1 compute can target distinct memory regions without aliasing.

Together these three lift kernel utilisation from 82% to 99.9% on the wiki-attested benchmark.

Why "memory-efficient inference" forces this bottleneck

Holding the entire decoded video in VRAM until done would avoid per-chunk D2H — but breaks the scale-to-arbitrarily-long-video invariant. Once the pipeline is committed to chunked decoding for memory-efficiency reasons, the per-chunk D2H + storage commit becomes the structural bottleneck. The fix is therefore not "do less D2H" but "overlap D2H with compute".

This is a clean instance of the broader pattern that memory- efficiency optimisations create new serialisation points that must then be optimised back out via overlap or batching.

D2H vs H2D — asymmetric importance

H2D (model load + input upload) is typically a one-time or infrequent event in inference: model weights load once at startup, inputs upload once per request. D2H is per-output: every chunk, every frame, every batch produces output that has to leave the GPU. That asymmetry is why D2H is more often the bottleneck than H2D for steady-state inference.

Generalisation

D2H bottlenecks recur across:

The same primitives — dual streams, pinned memory, double buffering, cross-stream barriers — apply at every altitude.

Seen in

  • sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances — first wiki canonicalisation. "Once a chunk has been decoded and processed on the GPU, the corresponding pixel frames must be transferred back to host (CPU) memory with a D2H transfer, so they can be written to a file or further processed." The D2H step is named as the binding bottleneck for the chunked VAE-decoder pipeline; the Asynchronous Frame Generation Pipeline overlaps it with the next chunk's compute.
Last updated · 542 distilled / 1,571 read